Re: [PATCH 5/9] io_uring: support SQE group

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	io-uring@vger.kernel.org, linux-block@vger.kernel.org,
	Pavel Begunkov <asml.silence@gmail.com>,
	ming.lei@redhat.com
Subject: Re: [PATCH 5/9] io_uring: support SQE group
Date: Wed, 24 Apr 2024 09:39:42 +0800	[thread overview]
Message-ID: <Zihi3nDAJg1s7Cws@fedora> (raw)
In-Reply-To: <Ziey53aADgxDrXZw@redhat.com>

On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote:
> Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
> > On 4/7/24 7:03 PM, Ming Lei wrote:
> > > SQE group is defined as one chain of SQEs starting with the first sqe that
> > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
> > > doesn't have it set, and it is similar with chain of linked sqes.
> > > 
> > > The 1st SQE is group leader, and the other SQEs are group member. The group
> > > leader is always freed after all members are completed. Group members
> > > aren't submitted until the group leader is completed, and there isn't any
> > > dependency among group members, and IOSQE_IO_LINK can't be set for group
> > > members, same with IOSQE_IO_DRAIN.
> > > 
> > > Typically the group leader provides or makes resource, and the other members
> > > consume the resource, such as scenario of multiple backup, the 1st SQE is to
> > > read data from source file into fixed buffer, the other SQEs write data from
> > > the same buffer into other destination files. SQE group provides very
> > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
> > > submitted in single syscall, no need to submit fs read SQE first, and wait
> > > until read SQE is completed, 2) no need to link all write SQEs together, then
> > > write SQEs can be submitted to files concurrently. Meantime application is
> > > simplified a lot in this way.
> > > 
> > > Another use case is to for supporting generic device zero copy:
> > > 
> > > - the lead SQE is for providing device buffer, which is owned by device or
> > >   kernel, can't be cross userspace, otherwise easy to cause leak for devil
> > >   application or panic
> > > 
> > > - member SQEs reads or writes concurrently against the buffer provided by lead
> > >   SQE
> > 
> > In concept, this looks very similar to "sqe bundles" that I played with
> > in the past:
> > 
> > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
> > 
> > Didn't look too closely yet at the implementation, but in spirit it's
> > about the same in that the first entry is processed first, and there's
> > no ordering implied between the test of the members of the bundle /
> > group.
> 
> When I first read this patch, I wondered if it wouldn't make sense to
> allow linking a group with subsequent requests, e.g. first having a few
> requests that run in parallel and once all of them have completed
> continue with the next linked one sequentially.
> 
> For SQE bundles, you reused the LINK flag, which doesn't easily allow
> this. Ming's patch uses a new flag for groups, so the interface would be
> more obvious, you simply set the LINK flag on the last member of the
> group (or on the leader, doesn't really matter). Of course, this doesn't
> mean it has to be implemented now, but there is a clear way forward if
> it's wanted.

Reusing LINK for bundle breaks existed link chains(BUNDLE linked to existed
link chain), so I think it may not work.

The link rule is explicit for sqe group:

- only group leader can set link flag, which is applied on the whole
group: the next sqe in the link chain won't be started until the
previous linked sqe group is completed

- link flag can't be set for group members

Also sqe group doesn't limit async for both group leader and member.

sqe group vs link & async is covered in the last liburing test code.

> 
> The part that looks a bit arbitrary in Ming's patch is that the group
> leader is always completed before the rest starts. It makes perfect
> sense in the context that this series is really after (enabling zero
> copy for ublk), but it doesn't really allow the case you mention in the
> SQE bundle commit message, running everything in parallel and getting a
> single CQE for the whole group.

I think it should be easy to cover bundle in this way, such as add one new
op IORING_OP_BUNDLE as Jens did, and implement the single CQE for whole group/bundle.

> 
> I suppose you could hack around the sequential nature of the first
> request by using an extra NOP as the group leader - which isn't any
> worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but
> the group completion would still be missing. (Of course, removing the
> sequential first operation would mean that ublk wouldn't have the buffer
> ready any more when the other requests try to use it, so that would
> defeat the purpose of the series...)
> 
> I wonder if we can still combine both approaches and create some
> generally useful infrastructure and not something where it's visible
> that it was designed mostly for ublk's special case and other use cases
> just happened to be enabled as a side effect.

sqe group is actually one generic interface, please see the multiple copy(
copy one file to multiple destinations in single syscall for one range) example
in the last patch, and it can support generic device zero copy: any device internal
buffer can be linked with io_uring operations in this way, which can't
be done by traditional splice/pipe.

I guess it can be used in network Rx zero copy too, but may depend on actual
network Rx use case.



Thanks,
Ming

next prev parent reply	other threads:[~2024-04-24  1:39 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-08  1:03 [RFC PATCH 0/9] io_uring: support sqe group and provide group kbuf Ming Lei
2024-04-08  1:03 ` [PATCH 1/9] io_uring: net: don't check sqe->__pad2[0] for send zc Ming Lei
2024-04-08  1:03 ` [PATCH 2/9] io_uring: support user sqe ext flags Ming Lei
2024-04-22 18:16   ` Jens Axboe
2024-04-23 13:57     ` Ming Lei
2024-04-29 15:24       ` Pavel Begunkov
2024-04-30  3:43         ` Ming Lei
2024-04-30 12:00           ` Pavel Begunkov
2024-04-30 12:56             ` Ming Lei
2024-04-30 14:10               ` Pavel Begunkov
2024-04-30 15:46                 ` Ming Lei
2024-05-02 14:22                   ` Pavel Begunkov
2024-05-04  1:19                     ` Ming Lei
2024-04-08  1:03 ` [PATCH 3/9] io_uring: add helper for filling cqes in __io_submit_flush_completions() Ming Lei
2024-04-08  1:03 ` [PATCH 4/9] io_uring: add one output argument to io_submit_sqe Ming Lei
2024-04-08  1:03 ` [PATCH 5/9] io_uring: support SQE group Ming Lei
2024-04-22 18:27   ` Jens Axboe
2024-04-23 13:08     ` Kevin Wolf
2024-04-24  1:39       ` Ming Lei [this message]
2024-04-25  9:27         ` Kevin Wolf
2024-04-26  7:53           ` Ming Lei
2024-04-26 17:05             ` Kevin Wolf
2024-04-29  3:34               ` Ming Lei
2024-04-29 15:48         ` Pavel Begunkov
2024-04-30  3:07           ` Ming Lei
2024-04-29 15:32       ` Pavel Begunkov
2024-04-30  3:03         ` Ming Lei
2024-04-30 12:27           ` Pavel Begunkov
2024-04-30 15:00             ` Ming Lei
2024-05-02 14:09               ` Pavel Begunkov
2024-05-04  1:56                 ` Ming Lei
2024-05-02 14:28               ` Pavel Begunkov
2024-04-24  0:46     ` Ming Lei
2024-04-08  1:03 ` [PATCH 6/9] io_uring: support providing sqe group buffer Ming Lei
2024-04-08  1:03 ` [PATCH 7/9] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
2024-04-08  1:03 ` [PATCH 8/9] ublk: support provide io buffer Ming Lei
2024-04-08  1:03 ` [RFC PATCH 9/9] liburing: support sqe ext_flags & sqe group Ming Lei
2024-04-19  0:55 ` [RFC PATCH 0/9] io_uring: support sqe group and provide group kbuf Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zihi3nDAJg1s7Cws@fedora \
    --to=ming.lei@redhat.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=io-uring@vger.kernel.org \
    --cc=kwolf@redhat.com \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox