public inbox for [email protected]
 help / color / mirror / Atom feed
From: Pavel Begunkov <[email protected]>
To: Ming Lei <[email protected]>
Cc: Jens Axboe <[email protected]>,
	[email protected], [email protected],
	Kevin Wolf <[email protected]>
Subject: Re: [PATCH 2/9] io_uring: support user sqe ext flags
Date: Thu, 2 May 2024 15:22:10 +0100	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <ZjESSsaDQ7aOploz@fedora>

On 4/30/24 16:46, Ming Lei wrote:
> On Tue, Apr 30, 2024 at 03:10:01PM +0100, Pavel Begunkov wrote:
>> On 4/30/24 13:56, Ming Lei wrote:
>>> On Tue, Apr 30, 2024 at 01:00:30PM +0100, Pavel Begunkov wrote:
>>>> On 4/30/24 04:43, Ming Lei wrote:
>>>>> On Mon, Apr 29, 2024 at 04:24:54PM +0100, Pavel Begunkov wrote:
>>>>>> On 4/23/24 14:57, Ming Lei wrote:
>>>>>>> On Mon, Apr 22, 2024 at 12:16:12PM -0600, Jens Axboe wrote:
>>>>>>>> On 4/7/24 7:03 PM, Ming Lei wrote:
>>>>>>>>> sqe->flags is u8, and now we have used 7 bits, so take the last one for
>>>>>>>>> extending purpose.
>>>>>>>>>
>>>>>>>>> If bit7(IOSQE_HAS_EXT_FLAGS_BIT) is 1, it means this sqe carries ext flags
>>>>>>>>> from the last byte(.ext_flags), or bit23~bit16 of sqe->uring_cmd_flags for
>>>>>>>>> IORING_OP_URING_CMD.
>>>>>>>>>
>>>>>>>>> io_slot_flags() return value is converted to `ULL` because the affected bits
>>>>>>>>> are beyond 32bit now.
>>>>>>>>
>>>>>>>> If we're extending flags, which is something we arguably need to do at
>>>>>>>> some point, I think we should have them be generic and not spread out.
>>>>>>>
>>>>>>> Sorry, maybe I don't get your idea, and the ext_flag itself is always
>>>>>>> initialized in io_init_req(), like normal sqe->flags, same with its
>>>>>>> usage.
>>>>>>>
>>>>>>>> If uring_cmd needs specific flags and don't have them, then we should
>>>>>>>> add it just for that.
>>>>>>>
>>>>>>> The only difference is that bit23~bit16 of sqe->uring_cmd_flags is
>>>>>>> borrowed for uring_cmd's ext flags, because sqe byte0~47 have been taken,
>>>>>>> and can't be reused for generic flag. If we want to use byte48~63, it has
>>>>>>> to be overlapped with uring_cmd's payload, and it is one generic sqe
>>>>>>> flag, which is applied on uring_cmd too.
>>>>>>
>>>>>> Which is exactly the mess nobody would want to see. And I'd also
>>>>>
>>>>> The trouble is introduced by supporting uring_cmd, and solving it by setting
>>>>> ext flags for uring_cmd specially by liburing helper is still reasonable or
>>>>> understandable, IMO.
>>>>>
>>>>>> argue 8 extra bits is not enough anyway, otherwise the history will
>>>>>> repeat itself pretty soon
>>>>>
>>>>> It is started with 8 bits, now doubled when io_uring is basically
>>>>> mature, even though history might repeat, it will take much longer time
>>>>
>>>> You're mistaken, only 7 bits are taken not because there haven't been
>>>> ideas and need to use them, but because we're out of space and we've
>>>> been saving it for something that might be absolutely necessary.
>>>>
>>>> POLL_FIRST IMHO should've been a generic feature, but it worked around
>>>> being a send/recv specific flag, same goes for the use of registered
>>>> buffers, not to mention ideas for which we haven't had enough flag space.
>>>
>>> OK, but I am wondering why not extend flags a bit so that io_uring can
>>> become extendable, just like this patch.
>>
>> That would be great if can be done cleanly. Even having it
>> non contig with the first 8bits is fine, but not conditional
>> depending on opcode is too much.
> 
> byte56~63 is used for uring_cmd payload, and it can't be done without
> depending on uring_cmd op.
> 
> The patch is simple, and this usage can be well-documented. In
> userspace, just one special helper is needed for setting uring_cmd
> ext_flags only.

One simple helper here, one simple helper there, one line in man
in some other place, in the end it'll turn to be a horrible mess.

It's not even a question when we'd see people asking "I used
set_ext_flags but why it doesn't work" from people missing a
separate cmd flag. Or just to be sure they would call both,
such things happen even with more straightforward APIs, and it's
just one problem.

> Except for this simple way, I don't see other approaches to extend sqe flags.

Well, that's why I described below how exactly it can be done
cleanly in a long run.

>>>>>>> That is the only way I thought of, or any other suggestion for extending sqe
>>>>>>> flags generically?
>>>>>>
>>>>>> idea 1: just use the last bit. When we need another one it'd be time
>>>>>> to think about a long overdue SQE layout v2, this way we can try
>>>>>> to make flags u32 and clean up other problems.
>>>>>
>>>>> It looks over-kill to invent SQE v2 just for solving the trouble in
>>>>> uring_cmd, and supporting two layouts can be new trouble for io_uring.
>>>>
>>>> Sounds too uring_cmd centric, it's not specifically for uring_cmd, it's
>>>> just one of reasons. As for overkill, that's why I'm not telling you
>>>> to change the layour, but suggesting to take the last bit for the
>>>> group flag and leave future problems for the future.
>>>
>>> You mentioned 8bit flag is designed from beginning just for saving
>>> space, so SQE V2 may not help us at all.
>>
>> Not sure what you mean. Retrospectively speaking, u8 for flags was
>> an oversight
> 
> You mentioned that:
> 
> 	You're mistaken, only 7 bits are taken not because there haven't been
> 	ideas and need to use them, but because we're out of space and we've
> 	been saving it for something that might be absolutely necessary.
> 
> Nothing is changed since then, so where to find more free space from
> 64 bytes for sqe flags now?

ditto

>>> If the last bit can be reserved for extend flag, it is still possible
>>> to extend sqe flags a bit, such as this patch. Otherwise, we just lose
>>> chance to extend sqe flags in future.
>>
>> That's why I mentioned SQE layout v2, i.e. a ctx flag which reshuffles
>> sqe fields in a better way. Surely there will be a lot of headache with
>> such a migration, but you can make flags a u32 then if you find space
>> and wouldn't even need and extending flag.
> 
> It is one hard problem, and it may not be answered in short time, cause all
> use cases need to be covered, meantime 3 extra bytes are saved from the
> reshuffling, with alignment respected meantime.
> 
> Also it isn't worth of layout v2 just for extending sqe flags.

Not just, by opting to the pain of migration to a new SQE layout
we can revise more API decisions. It's a separate topic for
discussion, and the latter it's done the better because we'd
collect more design mistakes that can be fixed.

Reiterating again, you're not blocked by it.

>>> Jens, can you share your idea/option wrt. extending sqe flags?
>>>
>>>>
>>>>
>>>>> Also I doubt the problem can be solved in layout v2:
>>>>>
>>>>> - 64 byte is small enough to support everything, same for v2
>>>>>
>>>>> - uring_cmd has only 16 bytes payload, taking any byte from
>>>>> the payload may cause trouble for drivers
>>>>>
>>>>> - the only possible change could still be to suppress bytes for OP
>>>>> specific flags, but it might cause trouble for some OPs, such as
>>>>> network.
>>>>
>>>> Look up sqe's __pad1, for example
>>>
>>> Suppose it is just for uring_cmd, '__pad1' is shared with cmd_op, which is aligned
>>> with ioctl cmd and is supposed to be 32bit.
>>
>> It's not shared with cmd_op, it's in a struct with it, unless you
>> use a u32 part of ->addr2/off, it's just that, a completely
>> unnecessary created padding. There was also another field left,
>> at least in case for nvme.
> 
> OK, __pad1 is available for uring_cmd, and it could be better to use
> __pad1 for uring_cmd ext flags, but it still depends on uring_cmd, and
> now ext_flags can be u16 or more, :-)
> 
> Thanks for sharing this point.
> 
>>
>>> Same with 'off' which is used in rw at least, if sqe group is to be
>>> generic flag.
>>>
>>>>
>>>>
>>>>>> idea 2: the group assembling flag can move into cmds. Very roughly:
>>>>>>
>>>>>> io_cmd_init() {
>>>>>> 	ublk_cmd_init();
>>>>>> }
>>>>>>
>>>>>> ublk_cmd_init() {
>>>>>> 	io_uring_start_grouping(ctx, cmd);
>>>>>> }
>>>>>>
>>>>>> io_uring_start_grouping(ctx, cmd) {
>>>>>> 	ctx->grouping = true;
>>>>>> 	ctx->group_head = cmd->req;
>>>>>> }
>>>>>
>>>>> How can you know one group is starting without any flag? Or you still
>>>>> suggest the approach taken in fused command?
>>>>
>>>> That would be ublk's business, e.g. ublk or cmds specific flag
>>>
>>> Then it becomes dedicated fused command actually, and last year's main
>>> concern is that the approach isn't generic.
>>
>> My concern is anything leaking into hot paths, even if it's a
>> generic feature (and I wouldn't call it that). The question is
>> rather at what degree. I wouldn't call groups in isolation
>> without zc exciting, and making it to look like a generic feature
>> just for the sake of it might even be worse than having it opcode
>> specific.
>>
>> Regardless, this approach doesn't forbid some other opcode from
>> doing ctx->grouping = true based on some other opcode specific
>> flag, doesn't necessarily binds it to cmds/ublk.
> 
> Yes.
> 
>>
>>>>>> submit_sqe() {
>>>>>> 	if (ctx->grouping) {
>>>>>> 		link_to_group(req, ctx->group_head);
>>>>>> 		if (!(req->flags & REQ_F_LINK))
>>>>>> 			ctx->grouping = false;
>>>>>> 	}
>>>>>> }
>>>>>
>>>>> The group needs to be linked to existed link chain, so reusing REQ_F_LINK may
>>>>> not doable.
>>>>
>>>> Would it break zero copy feature if you cant?
>>>
>>> The whole sqe group needs to be linked to existed link chain, so we
>>> can't reuse REQ_F_LINK here.
>>
>> Why though? You're passing a buffer from the head to all group-linked
>> requests, how do normal links come into the picture?
> 
> For example of ublk-nbd, tcp send requests have to be linked, and each
> send request belongs to one group in case of zero copy.

"Linked" like in "add to a group"? Or in terms of traditional
IOSQE_IO_LINK links? Because if it's about groups you don't need
IOSQE_IO_LINK. And if it's IOSQE_IO_LINK linking, then same
is supposedly can be done in userspace.

> Meantime linking is one generic feature, and it can be used everywhere, so
> not good to disallow it in application

-- 
Pavel Begunkov

  reply	other threads:[~2024-05-02 14:21 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-08  1:03 [RFC PATCH 0/9] io_uring: support sqe group and provide group kbuf Ming Lei
2024-04-08  1:03 ` [PATCH 1/9] io_uring: net: don't check sqe->__pad2[0] for send zc Ming Lei
2024-04-08  1:03 ` [PATCH 2/9] io_uring: support user sqe ext flags Ming Lei
2024-04-22 18:16   ` Jens Axboe
2024-04-23 13:57     ` Ming Lei
2024-04-29 15:24       ` Pavel Begunkov
2024-04-30  3:43         ` Ming Lei
2024-04-30 12:00           ` Pavel Begunkov
2024-04-30 12:56             ` Ming Lei
2024-04-30 14:10               ` Pavel Begunkov
2024-04-30 15:46                 ` Ming Lei
2024-05-02 14:22                   ` Pavel Begunkov [this message]
2024-05-04  1:19                     ` Ming Lei
2024-04-08  1:03 ` [PATCH 3/9] io_uring: add helper for filling cqes in __io_submit_flush_completions() Ming Lei
2024-04-08  1:03 ` [PATCH 4/9] io_uring: add one output argument to io_submit_sqe Ming Lei
2024-04-08  1:03 ` [PATCH 5/9] io_uring: support SQE group Ming Lei
2024-04-22 18:27   ` Jens Axboe
2024-04-23 13:08     ` Kevin Wolf
2024-04-24  1:39       ` Ming Lei
2024-04-25  9:27         ` Kevin Wolf
2024-04-26  7:53           ` Ming Lei
2024-04-26 17:05             ` Kevin Wolf
2024-04-29  3:34               ` Ming Lei
2024-04-29 15:48         ` Pavel Begunkov
2024-04-30  3:07           ` Ming Lei
2024-04-29 15:32       ` Pavel Begunkov
2024-04-30  3:03         ` Ming Lei
2024-04-30 12:27           ` Pavel Begunkov
2024-04-30 15:00             ` Ming Lei
2024-05-02 14:09               ` Pavel Begunkov
2024-05-04  1:56                 ` Ming Lei
2024-05-02 14:28               ` Pavel Begunkov
2024-04-24  0:46     ` Ming Lei
2024-04-08  1:03 ` [PATCH 6/9] io_uring: support providing sqe group buffer Ming Lei
2024-04-08  1:03 ` [PATCH 7/9] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
2024-04-08  1:03 ` [PATCH 8/9] ublk: support provide io buffer Ming Lei
2024-04-08  1:03 ` [RFC PATCH 9/9] liburing: support sqe ext_flags & sqe group Ming Lei
2024-04-19  0:55 ` [RFC PATCH 0/9] io_uring: support sqe group and provide group kbuf Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox