public inbox for [email protected]
 help / color / mirror / Atom feed
From: Ming Lei <[email protected]>
To: Pavel Begunkov <[email protected]>
Cc: Jens Axboe <[email protected]>,
	[email protected], [email protected],
	[email protected]
Subject: Re: [PATCH V6 6/8] io_uring: support providing sqe group buffer
Date: Fri, 11 Oct 2024 12:06:18 +0800	[thread overview]
Message-ID: <ZwikOt5cs4TtF5NJ@fedora> (raw)
In-Reply-To: <ZwiGpp4ePoCihohg@fedora>

On Fri, Oct 11, 2024 at 10:00:06AM +0800, Ming Lei wrote:
> On Thu, Oct 10, 2024 at 07:51:19PM +0100, Pavel Begunkov wrote:
> > On 10/10/24 04:00, Ming Lei wrote:
> > > On Wed, Oct 09, 2024 at 03:25:25PM +0100, Pavel Begunkov wrote:
> > > > On 10/6/24 09:20, Ming Lei wrote:
> > > > > On Fri, Oct 04, 2024 at 04:32:04PM +0100, Pavel Begunkov wrote:
> > > > > > On 9/12/24 11:49, Ming Lei wrote:
> > > > > > ...
> > > > > > > It can help to implement generic zero copy between device and related
> > > > > > > operations, such as ublk, fuse, vdpa,
> > > > > > > even network receive or whatever.
> > > > > > 
> > > > > > That's exactly the thing it can't sanely work with because
> > > > > > of this design.
> > > > > 
> > > > > The provide buffer design is absolutely generic, and basically
> > > > > 
> > > > > - group leader provides buffer for member OPs, and member OPs can borrow
> > > > > the buffer if leader allows by calling io_provide_group_kbuf()
> > > > > 
> > > > > - after member OPs consumes the buffer, the buffer is returned back by
> > > > > the callback implemented in group leader subsystem, so group leader can
> > > > > release related sources;
> > > > > 
> > > > > - and it is guaranteed that the buffer can be released always
> > > > > 
> > > > > The ublk implementation is pretty simple, it can be reused in device driver
> > > > > to share buffer with other kernel subsystems.
> > > > > 
> > > > > I don't see anything insane with the design.
> > > > 
> > > > There is nothing insane with the design, but the problem is cross
> > > > request error handling, same thing that makes links a pain to use.
> > > 
> > > Wrt. link, the whole group is linked in the chain, and it respects
> > > all existed link rule, care to share the trouble in link use case?
> > 
> > Error handling is a pain, it has been, even for pure link without
> > any groups. Even with a simple req1 -> req2, you need to track if
> > the first request fails you need to expect another failed CQE for
> > the second request, probably refcount (let's say non-atomically)
> > some structure and clean it up when you get both CQEs. It's not
> > prettier when the 2nd fails, especially if you consider short IO
> > and that you can't fully retry that partial IO, e.g. you consumed
> > data from the socket. And so on.
> > 
> > > The only thing I thought of is that group internal link isn't supported
> > > yet, but it may be added in future if use case is coming.
> > > 
> > > > It's good that with storage reads are reasonably idempotent and you
> > > > can be retried if needed. With sockets and streams, however, you
> > > > can't sanely borrow a buffer without consuming it, so if a member
> > > > request processing the buffer fails for any reason, the user data
> > > > will be dropped on the floor. I mentioned quite a while before,
> > > > if for example you stash the buffer somewhere you can access
> > > > across syscalls like the io_uring's registered buffer table, the
> > > > user at least would be able to find an error and then memcpy the
> > > > unprocessed data as a fallback.
> > > 
> > > I guess it is net rx case, which requires buffer to cross syscalls,
> > > then the buffer has to be owned by userspace, otherwise the buffer
> > > can be leaked easily.
> > > 
> > > That may not match with sqe group which is supposed to borrow kernel
> > > buffer consumed by users.
> > 
> > It doesn't necessarily require to keep buffers across syscalls
> > per se, it just can't drop the data it got on the floor. It's
> > just storage can read data again.
> 
> In case of short read, data is really stored(not dropped) in the provided
> buffer, and you can consume the short read data, or continue to read more to
> the same buffer.
> 
> What is the your real issue here?
> 
> > 
> > ...
> > > > > > > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> > > > > > > index 793d5a26d9b8..445e5507565a 100644
> > > > > > > --- a/include/linux/io_uring_types.h
> > > > > > > +++ b/include/linux/io_uring_types.h
> > ...
> > > > > > FWIW, would be nice if during init figure we can verify that the leader
> > > > > > provides a buffer IFF there is someone consuming it, but I don't think
> > > > > 
> > > > > It isn't doable, same reason with IORING_OP_PROVIDE_BUFFERS, since buffer can
> > > > > only be provided in ->issue().
> > > > 
> > > > In theory we could, in practise it'd be too much of a pain, I agree.
> > > > 
> > > > IORING_OP_PROVIDE_BUFFERS is different as you just stash the buffer
> > > > in the io_uring instance, and it's used at an unspecified time later
> > > > by some request. In this sense the API is explicit, requests that don't
> > > > support it but marked with IOSQE_BUFFER_SELECT will be failed by the
> > > > kernel.
> > > 
> > > That is also one reason why I add ->accept_group_kbuf.
> > 
> > I probably missed that, but I haven't seen that
> 
> Such as, any OPs with fixed buffer can't set ->accept_group_kbuf.
> 
> > 
> > > > > > the semantics is flexible enough to do it sanely. i.e. there are many
> > > > > > members in a group, some might want to use the buffer and some might not.
> > > > > > 
> > ...
> > > > > > > +	if (!kbuf->bvec)
> > > > > > > +		return -EINVAL;
> > > > > > 
> > > > > > How can this happen?
> > > > > 
> > > > > OK, we can run the check in uring_cmd API.
> > > > 
> > > > Not sure I follow, if a request providing a buffer can't set
> > > > a bvec it should just fail, without exposing half made
> > > > io_uring_kernel_buf to other requests.
> > > > 
> > > > Is it rather a WARN_ON_ONCE check?
> > > 
> > > I meant we can check it in API of io_provide_group_kbuf() since the group
> > > buffer is filled by driver, since then the buffer is immutable, and we
> > > needn't any other check.
> > 
> > That's be a buggy provider, so sounds like WARN_ON_ONCE
> 
> Not at all.
> 
> If the driver provides bad buffer, all group leader and members OP will be
> failed, and userspace can get notified.
> 
> > 
> > ...
> > > > > > >     		if (unlikely(ret < 0))
> > > > > > > @@ -593,6 +600,15 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
> > > > > > >     	if (issue_flags & IO_URING_F_NONBLOCK)
> > > > > > >     		flags |= MSG_DONTWAIT;
> > > > > > > +	if (req->flags & REQ_F_GROUP_KBUF) {
> > > > > > 
> > > > > > Does anything prevent the request to be marked by both
> > > > > > GROUP_KBUF and BUFFER_SELECT? In which case we'd set up
> > > > > > a group kbuf and then go to the io_do_buffer_select()
> > > > > > overriding all of that
> > > > > 
> > > > > It could be used in this way, and we can fail the member in
> > > > > io_queue_group_members().
> > > > 
> > > > That's where the opdef flag could actually be useful,
> > > > 
> > > > if (opdef[member]->accept_group_kbuf &&
> > > >      member->flags & SELECT_BUF)
> > > > 	fail;
> > > > 
> > > > 
> > > > > > > +		ret = io_import_group_kbuf(req,
> > > > > > > +					user_ptr_to_u64(sr->buf),
> > > > > > > +					sr->len, ITER_SOURCE,
> > > > > > > +					&kmsg->msg.msg_iter);
> > > > > > > +		if (unlikely(ret))
> > > > > > > +			return ret;
> > > > > > > +	}
> > > > > > > +
> > > > > > >     retry_bundle:
> > > > > > >     	if (io_do_buffer_select(req)) {
> > > > > > >     		struct buf_sel_arg arg = {
> > > > > > > @@ -1154,6 +1170,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
> > > > > > >     			goto out_free;
> > > > > > >     		}
> > > > > > >     		sr->buf = NULL;
> > > > > > > +	} else if (req->flags & REQ_F_GROUP_KBUF) {
> > > > > > 
> > > > > > What happens if we get a short read/recv?
> > > > > 
> > > > > For short read/recv, any progress is stored in iterator, nothing to do
> > > > > with the provide buffer, which is immutable.
> > > > > 
> > > > > One problem for read is reissue, but it can be handled by saving iter
> > > > > state after the group buffer is imported, I will fix it in next version.
> > > > > For net recv, offset/len of buffer is updated in case of short recv, so
> > > > > it works as expected.
> > > > 
> > > > That was one of my worries.
> > > > 
> > > > > Or any other issue with short read/recv? Can you explain in detail?
> > > > 
> > > > To sum up design wise, when members that are using the buffer as a
> > > > source, e.g. write/send, fail, the user is expected to usually reissue
> > > > both the write and the ublk cmd.
> > > > 
> > > > Let's say you have a ublk leader command providing a 4K buffer, and
> > > > you group it with a 4K send using the buffer. Let's assume the send
> > > > is short and does't only 2K of data. Then the user would normally
> > > > reissue:
> > > > 
> > > > ublk(4K, GROUP), send(off=2K)
> > > > 
> > > > That's fine assuming short IO is rare.
> > > > 
> > > > I worry more about the backward flow, ublk provides an "empty" buffer
> > > > to receive/read into. ublk wants to do something with the buffer in
> > > > the callback. What happens when read/recv is short (and cannot be
> > > > retried by io_uring)?
> > > > 
> > > > 1. ublk(provide empty 4K buffer)
> > > > 2. recv, ret=2K
> > > > 3. ->grp_kbuf_ack: ublk should commit back only 2K
> > > >     of data and not assume it's 4K
> > > 
> > > ->grp_kbuf_ack is supposed to only return back the buffer to the
> > > owner, and it doesn't care result of buffer consumption.
> > > 
> > > When ->grp_kbuf_ack() is done, it means this time buffer borrow is
> > > over.
> > > 
> > > When userspace figures out it is one short read, it will send one
> > > ublk uring_cmd to notify that this io command is completed with
> > > result(2k). ublk driver may decide to requeue this io command for
> > > retrying the remained bytes, when only remained part of the buffer
> > > is allowed to borrow in following provide uring command originated
> > > from userspace.
> > 
> > My apologies, I failed to notice that moment, even though should've
> > given it some thinking at the very beginning. I think that part would
> > be a terrible interface. Might be good enough for ublk, but we can't
> > be creating a ublk specific features that change the entire io_uring.
> > Without knowing how much data it actually got, in generic case you
> 
> You do know how much data actually got in the member OP, don't you?
> 
> > 1) need to require the buffer to be fully initialised / zeroed
> > before handing it.
> 
> The buffer is really initialized before being provided via
> io_provide_group_kbuf(). And it is one bvec buffer, anytime the part
> is consumed, the iterator is advanced, so always initialized buffer
> is provided to consumer OP.
> 
> > 2) Can't ever commit the data from the callback,
> 
> What do you mean `commit`?
> 
> The callback is documented clearly from beginning that it is for
> returning back the buffer to the owner.
> 
> Only member OPs consume buffer, and group leader provides valid buffer
> for member OP, and the buffer lifetime is aligned with group leader
> request.
> 
> > but it would need to wait until the userspace reacts. Yes, it
> > works in the specific context of ublk, but I don't think it works
> > as a generic interface.
> 
> It is just how ublk uses group buffer, but not necessary to be exactly
> this way.
> 
> Anytime the buffer is provided via io_provide_group_kbuf() successfully,
> the member OPs can consume it safely, and finally the buffer is returned
> back if all member OPs are completed. That is all.

Forget to mention:

The same buffer can be provided multiple times if it is valid, and one
offset can be added(not done yet in this patchset) easily on the provide
buffer uring_command, so the buffer can be advanced in case of short recv
in the provider side.

> Please explain why it isn't generic interface.


Thanks,
Ming


  reply	other threads:[~2024-10-11  4:06 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-12 10:49 [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
2024-09-12 10:49 ` [PATCH V6 1/8] io_uring: add io_link_req() helper Ming Lei
2024-09-12 10:49 ` [PATCH V6 2/8] io_uring: add io_submit_fail_link() helper Ming Lei
2024-09-12 10:49 ` [PATCH V6 3/8] io_uring: add helper of io_req_commit_cqe() Ming Lei
2024-09-12 10:49 ` [PATCH V6 4/8] io_uring: support SQE group Ming Lei
2024-10-04 13:12   ` Pavel Begunkov
2024-10-06  3:54     ` Ming Lei
2024-10-09 11:53       ` Pavel Begunkov
2024-10-09 12:14         ` Ming Lei
2024-09-12 10:49 ` [PATCH V6 5/8] io_uring: support sqe group with members depending on leader Ming Lei
2024-10-04 13:18   ` Pavel Begunkov
2024-10-06  3:54     ` Ming Lei
2024-09-12 10:49 ` [PATCH V6 6/8] io_uring: support providing sqe group buffer Ming Lei
2024-10-04 15:32   ` Pavel Begunkov
2024-10-06  8:20     ` Ming Lei
2024-10-09 14:25       ` Pavel Begunkov
2024-10-10  3:00         ` Ming Lei
2024-10-10 18:51           ` Pavel Begunkov
2024-10-11  2:00             ` Ming Lei
2024-10-11  4:06               ` Ming Lei [this message]
2024-10-06  9:47     ` Ming Lei
2024-10-09 11:57       ` Pavel Begunkov
2024-10-09 12:21         ` Ming Lei
2024-09-12 10:49 ` [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
2024-10-04 15:44   ` Pavel Begunkov
2024-10-06  8:46     ` Ming Lei
2024-10-09 15:14       ` Pavel Begunkov
2024-10-10  3:28         ` Ming Lei
2024-10-10 15:48           ` Pavel Begunkov
2024-10-10 19:31             ` Jens Axboe
2024-10-11  2:30               ` Ming Lei
2024-10-11  2:39                 ` Jens Axboe
2024-10-11  3:07                   ` Ming Lei
2024-10-11 13:24                     ` Jens Axboe
2024-10-11 14:20                       ` Ming Lei
2024-10-11 14:41                         ` Jens Axboe
2024-10-11 15:45                           ` Ming Lei
2024-10-11 16:49                             ` Jens Axboe
2024-10-12  3:35                               ` Ming Lei
2024-10-14 18:40                             ` Pavel Begunkov
2024-10-15 11:05                               ` Ming Lei
2024-09-12 10:49 ` [PATCH V6 8/8] ublk: support provide io buffer Ming Lei
2024-10-17 22:31   ` Uday Shankar
2024-10-18  0:45     ` Ming Lei
2024-09-26 10:27 ` [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
2024-09-26 12:18   ` Jens Axboe
2024-09-26 19:46     ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZwikOt5cs4TtF5NJ@fedora \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox