From: Ming Lei <[email protected]>
To: Jens Axboe <[email protected]>
Cc: Pavel Begunkov <[email protected]>,
[email protected], [email protected],
Uday Shankar <[email protected]>,
Akilesh Kailash <[email protected]>,
[email protected]
Subject: Re: [PATCH V8 0/8] io_uring: support sqe group and leased group kbuf
Date: Thu, 31 Oct 2024 10:53:25 +0800 [thread overview]
Message-ID: <ZyLxJdn7bboZMAcs@fedora> (raw)
In-Reply-To: <[email protected]>
On Wed, Oct 30, 2024 at 07:20:48AM -0600, Jens Axboe wrote:
> On 10/29/24 10:11 PM, Ming Lei wrote:
> > On Wed, Oct 30, 2024 at 11:08:16AM +0800, Ming Lei wrote:
> >> On Tue, Oct 29, 2024 at 08:43:39PM -0600, Jens Axboe wrote:
> >
> > ...
> >
> >>> You could avoid the OP dependency with just a flag, if you really wanted
> >>> to. But I'm not sure it makes a lot of sense. And it's a hell of a lot
> >>
> >> Yes, IO_LINK won't work for submitting multiple IOs concurrently, extra
> >> syscall makes application too complicated, and IO latency is increased.
> >>
> >>> simpler than the sqe group scheme, which I'm a bit worried about as it's
> >>> a bit complicated in how deep it needs to go in the code. This one
> >>> stands alone, so I'd strongly encourage we pursue this a bit further and
> >>> iron out the kinks. Maybe it won't work in the end, I don't know, but it
> >>> seems pretty promising and it's soooo much simpler.
> >>
> >> If buffer register and lookup are always done in ->prep(), OP dependency
> >> may be avoided.
> >
> > Even all buffer register and lookup are done in ->prep(), OP dependency
> > still can't be avoided completely, such as:
> >
> > 1) two local buffers for sending to two sockets
> >
> > 2) group 1: IORING_OP_LOCAL_KBUF1 & [send(sock1), send(sock2)]
> >
> > 3) group 2: IORING_OP_LOCAL_KBUF2 & [send(sock1), send(sock2)]
> >
> > group 1 and group 2 needs to be linked, but inside each group, the two
> > sends may be submitted in parallel.
>
> That is where groups of course work, in that you can submit 2 groups and
> have each member inside each group run independently. But I do think we
> need to decouple the local buffer and group concepts entirely. For the
> first step, getting local buffers working with zero copy would be ideal,
> and then just live with the fact that group 1 needs to be submitted
> first and group 2 once the first ones are done.
IMHO, it is one _kernel_ zero copy(_performance_) feature, which often imply:
- better performance expectation
- no big change on existed application for using this feature
Application developer is less interested in sort of crippled or immature
feature, especially need big change on existed code logic(then two code paths
need to maintain), with potential performance regression.
With sqe group and REQ_F_GROUP_KBUF, application just needs lines of
code change for using the feature, and it is pretty easy to evaluate
the feature since no any extra logic change & no extra syscall/wait
introduced. The whole patchset has been mature enough, unfortunately
blocked without obvious reasons.
>
> Once local buffers are done, we can look at doing the sqe grouping in a
> nice way. I do think it's a potentially powerful concept, but we're
> going to make a lot more progress on this issue if we carefully separate
> dependencies and get each of them done separately.
One fundamental difference between local buffer and REQ_F_GROUP_KBUF is
- local buffer has to be provided and used in ->prep()
- REQ_F_GROUP_KBUF needs to be provided in ->issue() instead of ->prep()
The only common code could be one buffer abstraction for OP to use, but
still used differently, ->prep() vs. ->issue().
So it is hard to call it decouple, especially REQ_F_GROUP_KBUF has been
simple enough, and the main change is to import it in OP code.
Local buffer is one smart idea, but I hope the following things may be
settled first:
1) is it generic enough to just allow to provide local buffer during
->prep()?
- this way actually becomes sync & nowait IO, instead AIO, and has been
one strong constraint from UAPI viewpoint.
- Driver may need to wait until some data comes, then return & provide
the buffer with data, and local buffer can't cover this case
2) is it allowed to call ->uring_cmd() from io_uring_cmd_prep()? If not,
any idea to call into driver for leasing the kernel buffer to io_uring?
3) in OP code, how to differentiate normal userspace buffer select with
local buffer? And how does OP know if normal buffer select or local
kernel buffer should be used? Some OP may want to use normal buffer
select instead of local buffer, others may want to use local buffer.
4) arbitrary numbers of local buffer needs to be supported, since IO
often comes at batch, it shouldn't be hard to support it by adding xarray
to submission state, what do you think of this added complexity? Without
supporting arbitrary number of local buffers, performance can be just
bad, it doesn't make sense as zc viewpoint. Meantime as number of local
buffer is increased, more rsrc_node & imu allocation is introduced, this
still may degrade perf a bit.
5) io_rsrc_node becomes part of interface between io_uring and driver
for releasing the leased buffer, so extra data has to be
added to `io_rsrc_node` for driver use.
However, from above, the following can be concluded at least:
- it isn't generic enough, #1, #3
- it still need sqe group
- it is much more complicated than REQ_F_GROUP_KBUF only
- it can't be more efficient
Thanks,
Ming
next prev parent reply other threads:[~2024-10-31 2:53 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-25 12:22 [PATCH V8 0/8] io_uring: support sqe group and leased group kbuf Ming Lei
2024-10-25 12:22 ` [PATCH V8 1/7] io_uring: add io_link_req() helper Ming Lei
2024-10-25 12:22 ` [PATCH V8 2/7] io_uring: add io_submit_fail_link() helper Ming Lei
2024-10-25 12:22 ` [PATCH V8 3/7] io_uring: add helper of io_req_commit_cqe() Ming Lei
2024-10-25 12:22 ` [PATCH V8 4/7] io_uring: support SQE group Ming Lei
2024-10-29 0:12 ` Jens Axboe
2024-10-29 1:50 ` Ming Lei
2024-10-29 16:38 ` Pavel Begunkov
2024-10-25 12:22 ` [PATCH V8 5/7] io_uring: support leased group buffer with REQ_F_GROUP_KBUF Ming Lei
2024-10-29 16:47 ` Pavel Begunkov
2024-10-30 0:45 ` Ming Lei
2024-10-30 1:25 ` Pavel Begunkov
2024-10-30 2:04 ` Ming Lei
2024-10-31 13:16 ` Pavel Begunkov
2024-10-25 12:22 ` [PATCH V8 6/7] io_uring/uring_cmd: support leasing device kernel buffer to io_uring Ming Lei
2024-10-25 12:22 ` [PATCH V8 7/7] ublk: support leasing io " Ming Lei
2024-10-29 17:01 ` [PATCH V8 0/8] io_uring: support sqe group and leased group kbuf Pavel Begunkov
2024-10-29 17:04 ` Jens Axboe
2024-10-29 19:18 ` Jens Axboe
2024-10-29 20:06 ` Jens Axboe
2024-10-29 21:26 ` Jens Axboe
2024-10-30 2:03 ` Ming Lei
2024-10-30 2:43 ` Jens Axboe
2024-10-30 3:08 ` Ming Lei
2024-10-30 4:11 ` Ming Lei
2024-10-30 13:20 ` Jens Axboe
2024-10-31 2:53 ` Ming Lei [this message]
2024-10-31 13:35 ` Jens Axboe
2024-10-31 15:07 ` Jens Axboe
2024-10-31 13:42 ` Pavel Begunkov
2024-10-30 13:18 ` Jens Axboe
2024-10-31 13:25 ` Pavel Begunkov
2024-10-31 14:29 ` Jens Axboe
2024-10-31 15:25 ` Pavel Begunkov
2024-10-31 15:42 ` Jens Axboe
2024-10-31 16:29 ` Pavel Begunkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZyLxJdn7bboZMAcs@fedora \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox