From: Ming Lei <[email protected]>
To: Jens Axboe <[email protected]>
Cc: Pavel Begunkov <[email protected]>,
[email protected], [email protected],
Uday Shankar <[email protected]>,
Akilesh Kailash <[email protected]>,
[email protected]
Subject: Re: [PATCH V8 0/8] io_uring: support sqe group and leased group kbuf
Date: Fri, 1 Nov 2024 09:39:27 +0800 [thread overview]
Message-ID: <ZyQxT9gp65WNajyL@fedora> (raw)
In-Reply-To: <[email protected]>
On Thu, Oct 31, 2024 at 07:35:35AM -0600, Jens Axboe wrote:
> On 10/30/24 8:53 PM, Ming Lei wrote:
> > On Wed, Oct 30, 2024 at 07:20:48AM -0600, Jens Axboe wrote:
> >> On 10/29/24 10:11 PM, Ming Lei wrote:
> >>> On Wed, Oct 30, 2024 at 11:08:16AM +0800, Ming Lei wrote:
> >>>> On Tue, Oct 29, 2024 at 08:43:39PM -0600, Jens Axboe wrote:
> >>>
> >>> ...
> >>>
> >>>>> You could avoid the OP dependency with just a flag, if you really wanted
> >>>>> to. But I'm not sure it makes a lot of sense. And it's a hell of a lot
> >>>>
> >>>> Yes, IO_LINK won't work for submitting multiple IOs concurrently, extra
> >>>> syscall makes application too complicated, and IO latency is increased.
> >>>>
> >>>>> simpler than the sqe group scheme, which I'm a bit worried about as it's
> >>>>> a bit complicated in how deep it needs to go in the code. This one
> >>>>> stands alone, so I'd strongly encourage we pursue this a bit further and
> >>>>> iron out the kinks. Maybe it won't work in the end, I don't know, but it
> >>>>> seems pretty promising and it's soooo much simpler.
> >>>>
> >>>> If buffer register and lookup are always done in ->prep(), OP dependency
> >>>> may be avoided.
> >>>
> >>> Even all buffer register and lookup are done in ->prep(), OP dependency
> >>> still can't be avoided completely, such as:
> >>>
> >>> 1) two local buffers for sending to two sockets
> >>>
> >>> 2) group 1: IORING_OP_LOCAL_KBUF1 & [send(sock1), send(sock2)]
> >>>
> >>> 3) group 2: IORING_OP_LOCAL_KBUF2 & [send(sock1), send(sock2)]
> >>>
> >>> group 1 and group 2 needs to be linked, but inside each group, the two
> >>> sends may be submitted in parallel.
> >>
> >> That is where groups of course work, in that you can submit 2 groups and
> >> have each member inside each group run independently. But I do think we
> >> need to decouple the local buffer and group concepts entirely. For the
> >> first step, getting local buffers working with zero copy would be ideal,
> >> and then just live with the fact that group 1 needs to be submitted
> >> first and group 2 once the first ones are done.
> >
> > IMHO, it is one _kernel_ zero copy(_performance_) feature, which often
> > imply:
> >
> > - better performance expectation
> > - no big change on existed application for using this feature
>
> For #2, really depends on what it is. But ideally, yes, agree.
>
> > Application developer is less interested in sort of crippled or
> > immature feature, especially need big change on existed code
> > logic(then two code paths need to maintain), with potential
> > performance regression.
> >
> > With sqe group and REQ_F_GROUP_KBUF, application just needs lines of
> > code change for using the feature, and it is pretty easy to evaluate
> > the feature since no any extra logic change & no extra syscall/wait
> > introduced. The whole patchset has been mature enough, unfortunately
> > blocked without obvious reasons.
>
> Let me tell you where I'm coming from. If you might recall, I originated
> the whole grouping idea. Didn't complete it, but it's essentially the
> same concept as REQ_F_GROUP_KBUF in that you have some dependents on a
> leader, and the dependents can run in parallel rather than being
> serialized by links. I'm obviously in favor of this concept, but I want
> to see it being done in such a way that it's actually something we can
> reason about and maintain. You want it for zero copy, which makes sense,
> but I also want to ensure it's a CLEAN implementation that doesn't have
> tangles in places it doesn't need to.
>
> You seem to be very hard to convince of making ANY changes at all. In
> your mind the whole thing is done, and it's being "blocked without
> obvious reason". It's not being blocked at all, I've been diligently
> trying to work with you recently on getting this done. I'm at least as
> interested as you in getting this work done. But I want you to work with
> me a bit on some items so we can get it into a shape where I'm happy
> with it, and I can maintain it going forward.
>
> So, please, rather than dig your heels in all the time, have an open
> mind on how we can accomplish some of these things.
>
>
> >> Once local buffers are done, we can look at doing the sqe grouping in a
> >> nice way. I do think it's a potentially powerful concept, but we're
> >> going to make a lot more progress on this issue if we carefully separate
> >> dependencies and get each of them done separately.
> >
> > One fundamental difference between local buffer and REQ_F_GROUP_KBUF is
> >
> > - local buffer has to be provided and used in ->prep()
> > - REQ_F_GROUP_KBUF needs to be provided in ->issue() instead of ->prep()
>
> It does not - the POC certainly did it in ->prep(), but all it really
> cares about is having the ring locked. ->prep() always has that,
> ->issue() _normally_ has that, unless it ends up in an io-wq punt.
>
> You can certainly do it in ->issue() and still have it be per-submit,
> the latter which I care about for safety reasons. This just means it has
> to be provided in the _first_ issue, and that IOSQE_ASYNC must not be
> set on the request. I think that restriction is fine, nobody should
> really be using IOSQE_ASYNC anyway.
Yes, IOSQE_ASYNC can't work, and IO_LINK can't work too.
The restriction on IO_LINK is too strong, because it needs big
application change.
>
> I think the original POC maybe did more harm than good in that it was
> too simplistic, and you seem too focused on the limits of that. So let
I am actually trying to thinking how to code local buffer, that is why I
puts these questions first.
> me detail what it actually could look like. We have io_submit_state in
> io_ring_ctx. This is per-submit private data, it's initialized and
> flushed for each io_uring_enter(2) that submits requests.
>
> We have a registered file and buffer table, file_table and buf_table.
> These have life times that are dependent on the ring and
> registration/unregistration. We could have a local_table. This one
> should be setup by some register command, eg reserving X slots for that.
> At the end of submit, we'd flush this table, putting nodes in there.
> Requests can look at the table in either prep or issue, and find buffer
> nodes. If a request uses one of these, it grabs a ref and hence has it
> available until it puts it at IO completion time. When a single submit
> context is done, local_table is iterated (if any entries exist) and
> existing nodes cleared and put.
>
> That provides a similar table to buf_table, but with a lifetime of a
> submit. Just like local buf. Yes it would not be private to a single
> group, it'd be private to a submit which has potentially bigger scope,
> but that should not matter.
>
> That should give you exactly what you need, if you use
> IORING_RSRC_KBUFFER in the local_table. But it could even be used for
> IORING_RSRC_BUFFER as well, providing buffers for a single submit cycle
> as well.
>
> Rather than do something completely on the side with
> io_uring_kernel_buf, we can use io_rsrc_node and io_mapped_ubuf for
> this. Which goes back to my initial rant in this email - use EXISTING
> infrastructure for these things. A big part of why this isn't making
> progress is that a lot of things are done on the side rather than being
> integrated. Then you need extra io_kiocb members, where it really should
> just be using io_rsrc_node and get everything else for free. No need to
> do special checking and putting seperately, it's a resource node just
> like any other resource node we already support.
>
> > The only common code could be one buffer abstraction for OP to use, but
> > still used differently, ->prep() vs. ->issue().
>
> With the prep vs issue thing not being an issue, then it sounds like we
IO_LINK is another blocker for prep vs issue thing.
> fully agree that a) it should be one buffer abstraction, and b) we
Yes, can't agree more.
> already have the infrastructure for this. We just need to add
> IORING_RSRC_KBUFFER, which I already posted some POC code for.
I am open to IORING_RSRC_KBUFFER if there isn't too strong limit for
application. Disallowing IOSQE_ASYNC is fine, but not allowing IO_LINK
does cause trouble for application, and need big change on app code.
>
> > So it is hard to call it decouple, especially REQ_F_GROUP_KBUF has been
> > simple enough, and the main change is to import it in OP code.
> >
> > Local buffer is one smart idea, but I hope the following things may be
> > settled first:
> >
> > 1) is it generic enough to just allow to provide local buffer during
> > ->prep()?
> >
> > - this way actually becomes sync & nowait IO, instead AIO, and has been
> > one strong constraint from UAPI viewpoint.
> >
> > - Driver may need to wait until some data comes, then return & provide
> > the buffer with data, and local buffer can't cover this case
>
> This should be moot now with the above explanation.
>
> > 2) is it allowed to call ->uring_cmd() from io_uring_cmd_prep()? If not,
> > any idea to call into driver for leasing the kernel buffer to io_uring?
>
> Ditto
>
> > 3) in OP code, how to differentiate normal userspace buffer select with
> > local buffer? And how does OP know if normal buffer select or local
> > kernel buffer should be used? Some OP may want to use normal buffer
> > select instead of local buffer, others may want to use local buffer.
>
> Yes this is a key question we need to figure out. Right now using fixed
> buffers needs to set ->buf_index, and the OP needs to know aboout it.
> let's not confuse it with buffer select, IOSQE_BUFFER_SELECT, as that's
> for provided buffers.
Indeed, here what matters is fixed buffer.
>
> > 4) arbitrary numbers of local buffer needs to be supported, since IO
> > often comes at batch, it shouldn't be hard to support it by adding xarray
> > to submission state, what do you think of this added complexity? Without
> > supporting arbitrary number of local buffers, performance can be just
> > bad, it doesn't make sense as zc viewpoint. Meantime as number of local
> > buffer is increased, more rsrc_node & imu allocation is introduced, this
> > still may degrade perf a bit.
>
> That's fine, we just need to reserve space for them upfront. I don't
> like the xarray idea, as:
>
> 1) xarray does internal locking, which we don't need here
> 2) The existing io_rsrc_data table is what is being used for
> io_rsrc_node management now. This would introduce another method for
> that.
>
> I do want to ensure that io_submit_state_finish() is still low overhead,
> and using an xarray would be more expensive than just doing:
>
> if (ctx->local_table.nr)
> flush_nodes();
>
> as you'd need to always setup an iterator. But this isn't really THAT
> important. The benefit of using an xarray would be that we'd get
> flexible storing of members without needing pre-registration, obviously.
The main trouble could be buffer table pre-allocation if xarray isn't
used.
>
> > 5) io_rsrc_node becomes part of interface between io_uring and driver
> > for releasing the leased buffer, so extra data has to be
> > added to `io_rsrc_node` for driver use.
>
> That's fine imho. The KBUFFER addition already adds the callback, we can
> add a data thing too. The kernel you based your code on has an
> io_rsrc_node that is 48 bytes in size, and my current tree has one where
> it's 32 bytes in size after the table rework. If we have to add 2x8b to
> support this, that's NOT a big deal and we just end up with a node
> that's the same size as before.
>
> And we get rid of this odd intermediate io_uring_kernel_buf struct,
> which is WAY bigger anyway, and requires TWO allocations where the
> existing io_mapped_ubuf embeds the bvec. I'd argue two vs one allocs is
> a much bigger deal for performance reasons.
The structure is actually preallocated in ublk, so it is zero allocation
in my patchset in case of non bio merge.
>
> As a final note, one thing I mentioned in an earlier email is that part
> of the issue here is that there are several things that need ironing
> out, and they are actually totally separate. One is the buffer side,
> which this email mostly deals with, the other one is the grouping
> concept.
>
> For the sqe grouping, one sticking point has been using that last
> sqe->flags bit. I was thinking about this last night, and what if we got
> away from using a flag entirely? At some point io_uring needs to deal
> with this flag limitation, but it's arguably a large effort, and I'd
> greatly prefer not having to paper over it to shove in grouped SQEs.
>
> So... what if we simply added a new OP, IORING_OP_GROUP_START, or
> something like that. Hence instead of having a designated group leader
> bit for an OP, eg:
>
> sqe = io_uring_get_sqe(ring);
> io_uring_prep_read(sqe, ...);
> sqe->flags |= IOSQE_GROUP_BIT;
>
> you'd do:
>
> sqe = io_uring_get_sqe(ring);
> io_uring_prep_group_start(sqe, ...);
> sqe->flags |= IOSQE_IO_LINK;
One problem is how to map IORING_OP_GROUP_START to uring_cmd.
IOSQE_IO_LINK isn't another trouble, because it becomes not possible
to model dependency among groups.
>
> sqe = io_uring_get_sqe(ring);
> io_uring_prep_read(sqe, ...);
>
> which would be the equivalent transformation - the read would be the
> group leader as it's the first member of that chain. The read should set
> IOSQE_IO_LINK for as long as it has members. The members in that group
> would NOT be serialized. They would use IOSQE_IO_LINK purely to be part
> of that group, but IOSQE_IO_LINK would not cause them to be serialized.
> Hence the link just implies membership, not ordering within the group.
>
> This removes the flag issue, with the sligth caveat that IOSQE_IO_LINK
> has a different meaning inside the group. Maybe you'd need a GROUP_END
No, for group leader IOSQE_IO_LINK works same as any other normal SQE.
Thanks,
Ming
next prev parent reply other threads:[~2024-11-01 1:39 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-25 12:22 [PATCH V8 0/8] io_uring: support sqe group and leased group kbuf Ming Lei
2024-10-25 12:22 ` [PATCH V8 1/7] io_uring: add io_link_req() helper Ming Lei
2024-10-25 12:22 ` [PATCH V8 2/7] io_uring: add io_submit_fail_link() helper Ming Lei
2024-10-25 12:22 ` [PATCH V8 3/7] io_uring: add helper of io_req_commit_cqe() Ming Lei
2024-10-25 12:22 ` [PATCH V8 4/7] io_uring: support SQE group Ming Lei
2024-10-29 0:12 ` Jens Axboe
2024-10-29 1:50 ` Ming Lei
2024-10-29 16:38 ` Pavel Begunkov
2024-10-31 21:24 ` Jens Axboe
2024-10-31 21:39 ` Jens Axboe
2024-11-01 0:00 ` Jens Axboe
2024-10-25 12:22 ` [PATCH V8 5/7] io_uring: support leased group buffer with REQ_F_GROUP_KBUF Ming Lei
2024-10-29 16:47 ` Pavel Begunkov
2024-10-30 0:45 ` Ming Lei
2024-10-30 1:25 ` Pavel Begunkov
2024-10-30 2:04 ` Ming Lei
2024-10-31 13:16 ` Pavel Begunkov
2024-11-01 1:04 ` Ming Lei
2024-10-25 12:22 ` [PATCH V8 6/7] io_uring/uring_cmd: support leasing device kernel buffer to io_uring Ming Lei
2024-10-25 12:22 ` [PATCH V8 7/7] ublk: support leasing io " Ming Lei
2024-10-29 17:01 ` [PATCH V8 0/8] io_uring: support sqe group and leased group kbuf Pavel Begunkov
2024-10-29 17:04 ` Jens Axboe
2024-10-29 19:18 ` Jens Axboe
2024-10-29 20:06 ` Jens Axboe
2024-10-29 21:26 ` Jens Axboe
2024-10-30 2:03 ` Ming Lei
2024-10-30 2:43 ` Jens Axboe
2024-10-30 3:08 ` Ming Lei
2024-10-30 4:11 ` Ming Lei
2024-10-30 13:20 ` Jens Axboe
2024-10-31 2:53 ` Ming Lei
2024-10-31 13:35 ` Jens Axboe
2024-10-31 15:07 ` Jens Axboe
2024-11-01 2:57 ` Ming Lei
2024-11-01 1:39 ` Ming Lei [this message]
2024-10-31 13:42 ` Pavel Begunkov
2024-10-30 13:18 ` Jens Axboe
2024-10-31 13:25 ` Pavel Begunkov
2024-10-31 14:29 ` Jens Axboe
2024-10-31 15:25 ` Pavel Begunkov
2024-10-31 15:42 ` Jens Axboe
2024-10-31 16:29 ` Pavel Begunkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZyQxT9gp65WNajyL@fedora \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox