From: Ming Lei <[email protected]>
To: Pavel Begunkov <[email protected]>
Cc: Jens Axboe <[email protected]>,
[email protected], [email protected],
[email protected]
Subject: Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
Date: Tue, 15 Oct 2024 19:05:35 +0800 [thread overview]
Message-ID: <Zw5Mf3pe06UOYLFW@fedora> (raw)
In-Reply-To: <[email protected]>
On Mon, Oct 14, 2024 at 07:40:40PM +0100, Pavel Begunkov wrote:
> On 10/11/24 16:45, Ming Lei wrote:
> > On Fri, Oct 11, 2024 at 08:41:03AM -0600, Jens Axboe wrote:
> > > On 10/11/24 8:20 AM, Ming Lei wrote:
> > > > On Fri, Oct 11, 2024 at 07:24:27AM -0600, Jens Axboe wrote:
> > > > > On 10/10/24 9:07 PM, Ming Lei wrote:
> > > > > > On Thu, Oct 10, 2024 at 08:39:12PM -0600, Jens Axboe wrote:
> > > > > > > On 10/10/24 8:30 PM, Ming Lei wrote:
> > > > > > > > Hi Jens,
> ...
> > > > > > Suppose we have N consumers OPs which depends on OP_BUF_UPDATE.
> > > > > >
> > > > > > 1) all N OPs are linked with OP_BUF_UPDATE
> > > > > >
> > > > > > Or
> > > > > >
> > > > > > 2) submit OP_BUF_UPDATE first, and wait its completion, then submit N
> > > > > > OPs concurrently.
> > > > >
> > > > > Correct
> > > > >
> > > > > > But 1) and 2) may slow the IO handing. In 1) all N OPs are serialized,
> > > > > > and 1 extra syscall is introduced in 2).
> > > > >
> > > > > Yes you don't want do do #1. But the OP_BUF_UPDATE is cheap enough that
> > > > > you can just do it upfront. It's not ideal in terms of usage, and I get
> > > > > where the grouping comes from. But is it possible to do the grouping in
> > > > > a less intrusive fashion with OP_BUF_UPDATE? Because it won't change any
> > > >
> > > > The most of 'intrusive' change is just on patch 4, and Pavel has commented
> > > > that it is good enough:
> > > >
> > > > https://lore.kernel.org/linux-block/ZwZzsPcXyazyeZnu@fedora/T/#m551e94f080b80ccbd2561e01da5ea8e17f7ee15d
>
> Trying to catch up on the thread. I do think the patch is tolerable and
> mergeable, but I do it adds quite a bit of complication to the path if
> you try to have a map in what state a request can be and what
I admit that sqe group adds a little complexity to the submission &
completion code, especially dealing with completion code.
But with your help, patch 4 has become easy to follow and sqe group
is well-defined now, and it does add new feature of N:M dependency,
otherwise one extra syscall is required for supporting N:M dependency,
this way not only saves one syscall, but also simplify application.
> dependencies are there, and then patches after has to go to every each
> io_uring opcode and add support for leased buffers. And I'm afraid
Only fast IO(net, fs) needs it, not see other OPs for such support.
> that we'll also need to feedback from completion of those to let
> the buffer know what ranges now has data / initialised. One typical
> problem for page flipping rx, for example, is that you need to have
> a full page of data to map it, otherwise it should be prezeroed,
> which is too expensive, same problem you can have without mmap'ing
> and directly exposing pages to the user.
From current design, the callback is only for returning the leased
buffer to owner, and we just need io_uring to do the favor for driver
by running aio with the leased buffer.
It can becomes quite complicated if we add feedback from completion.
Your catch on short read/recv is good, which may leak kernel
data, the problem exists on any other approach(provide kbuf) too, the
point is that it is kernel buffer, what do you think of the
following approach?
diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h
index d72a6bbbbd12..c1bc4179b390 100644
--- a/io_uring/kbuf.h
+++ b/io_uring/kbuf.h
@@ -242,4 +242,14 @@ static inline void io_drop_leased_grp_kbuf(struct io_kiocb *req)
if (gbuf)
gbuf->grp_kbuf_ack(gbuf);
}
+
+/* zero remained bytes of kernel buffer for avoiding to leak data */
+static inline void io_req_zero_remained(struct io_kiocb *req, struct iov_iter *iter)
+{
+ size_t left = iov_iter_count(iter);
+
+ printk("iter type %d, left %lu\n", iov_iter_rw(iter), left);
+ if (iov_iter_rw(iter) == READ && left > 0)
+ iov_iter_zero(left, iter);
+}
#endif
diff --git a/io_uring/net.c b/io_uring/net.c
index 6c32be92646f..022d81b6fc65 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -899,6 +899,8 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
*ret = IOU_STOP_MULTISHOT;
else
*ret = IOU_OK;
+ if (io_use_leased_grp_kbuf(req))
+ io_req_zero_remained(req, &kmsg->msg.msg_iter);
io_req_msg_cleanup(req, issue_flags);
return true;
}
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 76a443fa593c..565b0e742ee5 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -479,6 +479,11 @@ static bool __io_complete_rw_common(struct io_kiocb *req, long res)
}
req_set_fail(req);
req->cqe.res = res;
+ if (io_use_leased_grp_kbuf(req)) {
+ struct io_async_rw *io = req->async_data;
+
+ io_req_zero_remained(req, &io->iter);
+ }
}
return false;
}
>
> > > At least for me, patch 4 looks fine. The problem occurs when you start
> > > needing to support this different buffer type, which is in patch 6. I'm
> > > not saying we can necessarily solve this with OP_BUF_UPDATE, I just want
> > > to explore that path because if we can, then patch 6 turns into "oh
> > > let's just added registered/fixed buffer support to these ops that don't
> > > currently support it". And that would be much nicer indeed.
> ...
> > > > > would be totally fine in terms of performance. OP_BUF_UPDATE will
> > > > > _always_ completely immediately and inline, which means that it'll
> > > > > _always_ be immediately available post submission. The only think you'd
> > > > > ever have to worry about in terms of failure is a badly formed request,
> > > > > which is a programming issue, or running out of memory on the host.
> > > > >
> > > > > > Also it makes error handling more complicated, io_uring has to remove
> > > > > > the kernel buffer when the current task is exit, dependency or order with
> > > > > > buffer provider is introduced.
> > > > >
> > > > > Why would that be? They belong to the ring, so should be torn down as
> > > > > part of the ring anyway? Why would they be task-private, but not
> > > > > ring-private?
> > > >
> > > > It is kernel buffer, which belongs to provider(such as ublk) instead
> > > > of uring, application may panic any time, then io_uring has to remove
> > > > the buffer for notifying the buffer owner.
> > >
> > > But it could be an application buffer, no? You'd just need the
> > > application to provide it to ublk and have it mapped, rather than have
> > > ublk allocate it in-kernel and then use that.
> >
> > The buffer is actually kernel 'request/bio' pages of /dev/ublkbN, and now we
> > forward and borrow it to io_uring OPs(fs rw, net send/recv), so it can't be
> > application buffer, not same with net rx.
>
> I don't see any problem in dropping buffers from the table
> on exit, we have a lot of stuff a thread does for io_uring
> when it exits.
io_uring cancel handling has been complicated enough, now uring
command have two cancel code paths if provide kernel buffer is
added:
1) io_uring_try_cancel_uring_cmd()
2) the kernel buffer cancel code path
There might be dependency for the two.
>
>
> > > > In concept grouping is simpler because:
> > > >
> > > > - buffer lifetime is aligned with group leader lifetime, so we needn't
> > > > worry buffer leak because of application accidental exit
> > >
> > > But if it was an application buffer, that would not be a concern.
> >
> > Yeah, but storage isn't same with network, here application buffer can't
> > support zc.
>
> Maybe I missed how it came to app buffers, but the thing I
> initially mentioned is about storing the kernel buffer in
> the table, without any user pointers and user buffers.
Yeah, just some random words, please ignore it.
>
> > > > - the buffer is borrowed to consumer OPs, and returned back after all
> > > > consumers are done, this way avoids any dependency
> > > >
> > > > Meantime OP_BUF_UPDATE(provide buffer OP, remove buffer OP) becomes more
> > > > complicated:
> > > >
> > > > - buffer leak because of app panic
>
> Then io_uring dies and releases buffers. Or we can even add
> some code removing it, as mentioned, any task that has ever
> submitted a request already runs some io_uring code on exit.
>
> > > > - buffer dependency issue: consumer OPs depend on provide buffer OP,
> > > > remove buffer OP depends on consumer OPs; two syscalls has to be
> > > > added for handling single ublk IO.
> > >
> > > Seems like most of this is because of the kernel buffer too, no?
> >
> > Yeah.
> >
> > >
> > > I do like the concept of the ephemeral buffer, the downside is that we
> > > need per-op support for it too. And while I'm not totally against doing
> >
> > Can you explain per-op support a bit?
> >
> > Now the buffer has been provided by one single uring command.
> >
> > > that, it would be lovely if we could utilize and existing mechanism for
> > > that rather than add another one.
>
> That would also be more flexible as not everything can be
> handled by linked request logic, and wouldn't require hacking
> into every each request type to support "consuming" leased
> buffers.
I guess you mean 'consuming' the code added in net.c and rw.c, which
can't be avoided, because it is kernel buffer, and we are supporting
it first time:
- there isn't userspace address, not like buffer select & fixed buffer
- the kernel buffer has to be returned to the provider
- the buffer has to be imported in ->issue(), can't be done in ->prep()
- short read/recv has to be dealt with
>
> Overhead wise, let's say we fix buffer binding order and delay it
> as elaborated on below, then you can provide a buffer and link a
> consumer (e.g. send request or anything else) just as you do
> it now. You can also link a request returning the buffer to the
> same chain if you don't need extra flexibility.
>
> As for groups, they're complicated because of the order inversion,
IMO, group complication only exists in the completion side, fortunately
it is well defined now.
buffer and table causes more complicated application, with bad
performance:
- two syscalls(uring_enter trips) are added for each ublk IO
- one extra request is added(group needs 2 requests, and add buffer
needs 3 requests for the simples case), then bigger SQ & CQ size
- extra cancel handling
group simplifies buffer lifetime a lot, since io_uring needn't to
care it at all.
> the notion of a leader and so. If we get rid of the need to impose
> more semantics onto it by mediating buffer transition through the
> table, I think we can do groups if needed but make it simpler.
The situation is just that driver leases the buffer to io_uring, not
have to transfer it to io_uring. Once it is added to table, it has to
be removed from table.
It is just like local variable vs global variable, the latter is more
complicated to use.
>
> > > What's preventing it from registering it in ->prep()? It would be a bit
> > > odd, but there would be nothing preventing it codewise, outside of the
> > > oddity of ->prep() not being idempotent at that point. Don't follow why
> > > that would be necessary, though, can you expand?
> >
> > ->prep() doesn't export to uring cmd, and we may not want to bother
> > drivers.
> >
> > Also remove buffer still can't be done in ->prep().
> >
> > Not dig into further, one big thing could be that dependency isn't
> > respected in ->prep().
>
> And we can just fix that and move the choosing of a buffer
> to ->issue(), in which case a buffer provided by one request
> will be observable to its linked requests.
This patch does import buffer in ->issue(), as I explained to Jens:
- either all OPs are linked together with add_kbuf & remove_kbuf, then
all OPs can't be issued concurrently
- or two syscalls are added for handling single ublk IO
The two are not great from performance viewpoint, but also complicates
application.
I don't think the above two can be avoided, or can you explain how to
do it?
thanks,
Ming
next prev parent reply other threads:[~2024-10-15 11:05 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-12 10:49 [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
2024-09-12 10:49 ` [PATCH V6 1/8] io_uring: add io_link_req() helper Ming Lei
2024-09-12 10:49 ` [PATCH V6 2/8] io_uring: add io_submit_fail_link() helper Ming Lei
2024-09-12 10:49 ` [PATCH V6 3/8] io_uring: add helper of io_req_commit_cqe() Ming Lei
2024-09-12 10:49 ` [PATCH V6 4/8] io_uring: support SQE group Ming Lei
2024-10-04 13:12 ` Pavel Begunkov
2024-10-06 3:54 ` Ming Lei
2024-10-09 11:53 ` Pavel Begunkov
2024-10-09 12:14 ` Ming Lei
2024-09-12 10:49 ` [PATCH V6 5/8] io_uring: support sqe group with members depending on leader Ming Lei
2024-10-04 13:18 ` Pavel Begunkov
2024-10-06 3:54 ` Ming Lei
2024-09-12 10:49 ` [PATCH V6 6/8] io_uring: support providing sqe group buffer Ming Lei
2024-10-04 15:32 ` Pavel Begunkov
2024-10-06 8:20 ` Ming Lei
2024-10-09 14:25 ` Pavel Begunkov
2024-10-10 3:00 ` Ming Lei
2024-10-10 18:51 ` Pavel Begunkov
2024-10-11 2:00 ` Ming Lei
2024-10-11 4:06 ` Ming Lei
2024-10-06 9:47 ` Ming Lei
2024-10-09 11:57 ` Pavel Begunkov
2024-10-09 12:21 ` Ming Lei
2024-09-12 10:49 ` [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
2024-10-04 15:44 ` Pavel Begunkov
2024-10-06 8:46 ` Ming Lei
2024-10-09 15:14 ` Pavel Begunkov
2024-10-10 3:28 ` Ming Lei
2024-10-10 15:48 ` Pavel Begunkov
2024-10-10 19:31 ` Jens Axboe
2024-10-11 2:30 ` Ming Lei
2024-10-11 2:39 ` Jens Axboe
2024-10-11 3:07 ` Ming Lei
2024-10-11 13:24 ` Jens Axboe
2024-10-11 14:20 ` Ming Lei
2024-10-11 14:41 ` Jens Axboe
2024-10-11 15:45 ` Ming Lei
2024-10-11 16:49 ` Jens Axboe
2024-10-12 3:35 ` Ming Lei
2024-10-14 18:40 ` Pavel Begunkov
2024-10-15 11:05 ` Ming Lei [this message]
2024-09-12 10:49 ` [PATCH V6 8/8] ublk: support provide io buffer Ming Lei
2024-10-17 22:31 ` Uday Shankar
2024-10-18 0:45 ` Ming Lei
2024-09-26 10:27 ` [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
2024-09-26 12:18 ` Jens Axboe
2024-09-26 19:46 ` Pavel Begunkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zw5Mf3pe06UOYLFW@fedora \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox