From: Pavel Begunkov <[email protected]>
To: Ming Lei <[email protected]>, Kevin Wolf <[email protected]>
Cc: Jens Axboe <[email protected]>,
[email protected], [email protected]
Subject: Re: [PATCH 5/9] io_uring: support SQE group
Date: Mon, 29 Apr 2024 16:48:37 +0100 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <Zihi3nDAJg1s7Cws@fedora>
On 4/24/24 02:39, Ming Lei wrote:
> On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote:
>> Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
>>> On 4/7/24 7:03 PM, Ming Lei wrote:
>>>> SQE group is defined as one chain of SQEs starting with the first sqe that
>>>> has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
>>>> doesn't have it set, and it is similar with chain of linked sqes.
>>>>
>>>> The 1st SQE is group leader, and the other SQEs are group member. The group
>>>> leader is always freed after all members are completed. Group members
>>>> aren't submitted until the group leader is completed, and there isn't any
>>>> dependency among group members, and IOSQE_IO_LINK can't be set for group
>>>> members, same with IOSQE_IO_DRAIN.
>>>>
>>>> Typically the group leader provides or makes resource, and the other members
>>>> consume the resource, such as scenario of multiple backup, the 1st SQE is to
>>>> read data from source file into fixed buffer, the other SQEs write data from
>>>> the same buffer into other destination files. SQE group provides very
>>>> efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
>>>> submitted in single syscall, no need to submit fs read SQE first, and wait
>>>> until read SQE is completed, 2) no need to link all write SQEs together, then
>>>> write SQEs can be submitted to files concurrently. Meantime application is
>>>> simplified a lot in this way.
>>>>
>>>> Another use case is to for supporting generic device zero copy:
>>>>
>>>> - the lead SQE is for providing device buffer, which is owned by device or
>>>> kernel, can't be cross userspace, otherwise easy to cause leak for devil
>>>> application or panic
>>>>
>>>> - member SQEs reads or writes concurrently against the buffer provided by lead
>>>> SQE
>>>
>>> In concept, this looks very similar to "sqe bundles" that I played with
>>> in the past:
>>>
>>> https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
>>>
>>> Didn't look too closely yet at the implementation, but in spirit it's
>>> about the same in that the first entry is processed first, and there's
>>> no ordering implied between the test of the members of the bundle /
>>> group.
>>
>> When I first read this patch, I wondered if it wouldn't make sense to
>> allow linking a group with subsequent requests, e.g. first having a few
>> requests that run in parallel and once all of them have completed
>> continue with the next linked one sequentially.
>>
>> For SQE bundles, you reused the LINK flag, which doesn't easily allow
>> this. Ming's patch uses a new flag for groups, so the interface would be
>> more obvious, you simply set the LINK flag on the last member of the
>> group (or on the leader, doesn't really matter). Of course, this doesn't
>> mean it has to be implemented now, but there is a clear way forward if
>> it's wanted.
>
> Reusing LINK for bundle breaks existed link chains(BUNDLE linked to existed
> link chain), so I think it may not work.
>
> The link rule is explicit for sqe group:
>
> - only group leader can set link flag, which is applied on the whole
> group: the next sqe in the link chain won't be started until the
> previous linked sqe group is completed
>
> - link flag can't be set for group members
>
> Also sqe group doesn't limit async for both group leader and member.
>
> sqe group vs link & async is covered in the last liburing test code.
>
>>
>> The part that looks a bit arbitrary in Ming's patch is that the group
>> leader is always completed before the rest starts. It makes perfect
>> sense in the context that this series is really after (enabling zero
>> copy for ublk), but it doesn't really allow the case you mention in the
>> SQE bundle commit message, running everything in parallel and getting a
>> single CQE for the whole group.
>
> I think it should be easy to cover bundle in this way, such as add one new
> op IORING_OP_BUNDLE as Jens did, and implement the single CQE for whole group/bundle.
>
>>
>> I suppose you could hack around the sequential nature of the first
>> request by using an extra NOP as the group leader - which isn't any
>> worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but
>> the group completion would still be missing. (Of course, removing the
>> sequential first operation would mean that ublk wouldn't have the buffer
>> ready any more when the other requests try to use it, so that would
>> defeat the purpose of the series...)
>>
>> I wonder if we can still combine both approaches and create some
>> generally useful infrastructure and not something where it's visible
>> that it was designed mostly for ublk's special case and other use cases
>> just happened to be enabled as a side effect.
>
> sqe group is actually one generic interface, please see the multiple copy(
> copy one file to multiple destinations in single syscall for one range) example
> in the last patch, and it can support generic device zero copy: any device internal
> buffer can be linked with io_uring operations in this way, which can't
> be done by traditional splice/pipe.
>
> I guess it can be used in network Rx zero copy too, but may depend on actual
> network Rx use case.
I doubt. With storage same data can be read twice. Socket recv consumes
data. Locking a buffer over the duration of another IO doesn't really sound
plausible, same we returning a buffer back. It'd be different if you can
read the buffer into the userspace if something goes wrong, but perhaps
you remember the fused discussion.
--
Pavel Begunkov
next prev parent reply other threads:[~2024-04-29 15:48 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-08 1:03 [RFC PATCH 0/9] io_uring: support sqe group and provide group kbuf Ming Lei
2024-04-08 1:03 ` [PATCH 1/9] io_uring: net: don't check sqe->__pad2[0] for send zc Ming Lei
2024-04-08 1:03 ` [PATCH 2/9] io_uring: support user sqe ext flags Ming Lei
2024-04-22 18:16 ` Jens Axboe
2024-04-23 13:57 ` Ming Lei
2024-04-29 15:24 ` Pavel Begunkov
2024-04-30 3:43 ` Ming Lei
2024-04-30 12:00 ` Pavel Begunkov
2024-04-30 12:56 ` Ming Lei
2024-04-30 14:10 ` Pavel Begunkov
2024-04-30 15:46 ` Ming Lei
2024-05-02 14:22 ` Pavel Begunkov
2024-05-04 1:19 ` Ming Lei
2024-04-08 1:03 ` [PATCH 3/9] io_uring: add helper for filling cqes in __io_submit_flush_completions() Ming Lei
2024-04-08 1:03 ` [PATCH 4/9] io_uring: add one output argument to io_submit_sqe Ming Lei
2024-04-08 1:03 ` [PATCH 5/9] io_uring: support SQE group Ming Lei
2024-04-22 18:27 ` Jens Axboe
2024-04-23 13:08 ` Kevin Wolf
2024-04-24 1:39 ` Ming Lei
2024-04-25 9:27 ` Kevin Wolf
2024-04-26 7:53 ` Ming Lei
2024-04-26 17:05 ` Kevin Wolf
2024-04-29 3:34 ` Ming Lei
2024-04-29 15:48 ` Pavel Begunkov [this message]
2024-04-30 3:07 ` Ming Lei
2024-04-29 15:32 ` Pavel Begunkov
2024-04-30 3:03 ` Ming Lei
2024-04-30 12:27 ` Pavel Begunkov
2024-04-30 15:00 ` Ming Lei
2024-05-02 14:09 ` Pavel Begunkov
2024-05-04 1:56 ` Ming Lei
2024-05-02 14:28 ` Pavel Begunkov
2024-04-24 0:46 ` Ming Lei
2024-04-08 1:03 ` [PATCH 6/9] io_uring: support providing sqe group buffer Ming Lei
2024-04-08 1:03 ` [PATCH 7/9] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
2024-04-08 1:03 ` [PATCH 8/9] ublk: support provide io buffer Ming Lei
2024-04-08 1:03 ` [RFC PATCH 9/9] liburing: support sqe ext_flags & sqe group Ming Lei
2024-04-19 0:55 ` [RFC PATCH 0/9] io_uring: support sqe group and provide group kbuf Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox