From: Pavel Begunkov <[email protected]>
To: Josh Triplett <[email protected]>
Cc: Jens Axboe <[email protected]>,
[email protected],
"David S . Miller" <[email protected]>,
Jakub Kicinski <[email protected]>,
[email protected], [email protected],
Stefan Metzmacher <[email protected]>
Subject: Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table
Date: Sun, 15 Aug 2021 15:23:25 +0100 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 8/15/21 11:48 AM, Pavel Begunkov wrote:
> On 8/15/21 4:31 AM, Josh Triplett wrote:
>> On Sat, Aug 14, 2021 at 01:50:24PM +0100, Pavel Begunkov wrote:
>>> On 8/13/21 8:00 PM, Josh Triplett wrote:
>>>> Rather than using sqe->file_index - 1, which feels like an error-prone
>>>> interface, I think it makes sense to use a dedicated flag for this, like
>>>> IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
>>>> including open, accept, and in the future many other operations such as
>>>> memfd_create. (Imagine using a single ring submission to open a memfd,
>>>> write a buffer into it, seal it, send it over a UNIX socket, and then
>>>> close it.)
>>>>
>>>> The only downside is that you'll need to reject that flag in all
>>>> non-open operations. One way to unify that code might be to add a flag
>>>> in io_op_def for open-like operations, and then check in common code for
>>>> the case of non-open-like operations passing IOSQE_OPEN_FIXED.
>>>
>>> io_uring is really thin, and so I absolutely don't want any extra
>>> overhead in the generic path, IOW anything affecting
>>> reads/writes/sends/recvs.
>>
>> There are already several checks for valid flags in io_init_req. For
>> instance:
>
> Yes, it's horrible and I don't want to make it any worse.
>
>> if ((sqe_flags & IOSQE_BUFFER_SELECT) &&
>> !io_op_defs[req->opcode].buffer_select)
>> return -EOPNOTSUPP;
>> It'd be trivial to make io_op_defs have a "valid flags" byte, and one
>> bitwise op tells you if any invalid flags were passed. *Zero* additional
>> overhead for other operations.
>
> Good point
>
>> Alternatively, since there are so few operations that open a file
>> descriptor, you could just add a separate opcode for those few
>> operations. That still seems preferable to overloading a 16-bit index
>> field for this.
>
> I don't think so
>
>> With this new mechanism, I think we're going to want to support more
>> than 65535 fixed-file entries. I can easily imagine wanting to handle
>> hundreds of thousands of files or sockets this way.
>
> May be. What I'm curious about is that the feature doesn't really
> change anything in this regard, but seems I haven't heard people
> asking for larger tables.
>
>>> The other reason is that there are only 2 bits left in sqe->flags,
>>> and we may use them for something better, considering that it's
>>> only open/accept and not much as this.
>>
>> pipe, dup3, socket, socketpair, pidfds (via either pidfd_open or a
>> ring-based spawn mechanism), epoll_create, inotify, fanotify, signalfd,
>> timerfd, eventfd, memfd_create, userfaultfd, open_tree, fsopen, fsmount,
>> memfd_secret.
>
> We could argue for many of those whether they should be in io_uring,
> and whether there are many benefits having them async and so. It would
> have another story if all the ecosystem was io_uring centric, but
> that's speculations.
>
>> Of those, I personally would *love* to have at least pipe, socket,
>> pidfd, memfd_create, and fsopen/fsmount/open_tree, plus some manner of
>> dup-like operation for moving things between the fixed-file table and
>> file descriptors.
>>
>> I think this is valuable and versatile enough to merit a flag. It would
>> also be entirely reasonable to create separate operations for these. But
>> either way, I don't think this should just be determined by whether a
>> 16-bit index is non-zero.
>>
>>> I agree that it feels error-prone, but at least it can be wrapped
>>> nicely enough in liburing, e.g.
>>>
>>> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd,
>>> const char *path, int flags,
>>> mode_t mode, int slot_idx);
>>
>> That wrapper wouldn't be able to handle more than a 16-bit slot index
>> though.
>
> It would. Note, the index is "int" there, so if doesn't fit
> into u16, we can fail it. And do conversion if required.
>
>>>> Also, rather than using a 16-bit index for the fixed file table and
>>>> potentially requiring expansion into a different field in the future,
>>>> what about overlapping it with the nofile field in the open and accept
>>>> requests? If they're not opening a normal file descriptor, they don't
>>>> need nofile. And in the original sqe, you can then overlap it with a
>>>> 32-bit field like splice_fd_in.
>>>
>>> There is no nofile in SQEs, though
>>>
>>> req->open.nofile = rlimit(RLIMIT_NOFILE);
>>
>> nofile isn't needed for opening into the fixed-file table, so it could
>> be omitted in that case, and another field unioned with it.
>
> There is no problem to place it internally. Moreover, it's at the
> moment uniformly placed inside io_kiocb, but with nofile we'd need
> to find the place on per-op basis.
>
> Not like any matters, it's just bike shedding.
>
>> allow passing a 32-bit fixed-file index into open and accept without
>> growing the size of their structures. I think, with this new capability,
>> we're going to want a large number of fixed files available.
>>
>> In the SQE, you could overlap it with the splice_fd_in field, which
>> isn't needed by any calls other than splice.
>
> But it doesn't mean it won't be used, as happened with pretty every
> other field in SQE. So, it rather depends on what packing is wanted.
> And reusing almost never used ->buf_index (and potentially ->ioprio),
> sounds reasonable.
Aliasing with ->splice_fd_in looks better indeed (apart from it
inherently not being checked, but meh?), But I still don't think
it's a good option to use sqe->flags, and so still needs some way
to switch between modes.
Can be sqe->rw_flags as once was done with SPLICE_F_FD_IN_FIXED,
but it's IMHO an ugly hackish way. I still lean to the
0 vs >0 encoding .
--
Pavel Begunkov
next prev parent reply other threads:[~2021-08-15 14:24 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-13 16:43 [PATCH v2 0/4] open/accept directly into io_uring fixed file table Pavel Begunkov
2021-08-13 16:43 ` [PATCH v2 1/4] net: add accept helper not installing fd Pavel Begunkov
2021-08-13 16:43 ` [PATCH v2 2/4] io_uring: openat directly into fixed fd table Pavel Begunkov
2021-08-13 16:43 ` [PATCH v2 3/4] io_uring: hand code io_accept() fd installing Pavel Begunkov
2021-08-13 16:43 ` [PATCH v2 4/4] io_uring: accept directly into fixed file table Pavel Begunkov
2021-08-13 19:00 ` [PATCH v2 0/4] open/accept directly into io_uring " Josh Triplett
2021-08-14 12:50 ` Pavel Begunkov
2021-08-14 23:03 ` Jens Axboe
2021-08-15 3:42 ` Josh Triplett
2021-08-15 15:05 ` Jens Axboe
2021-08-15 15:12 ` Jens Axboe
2021-08-15 13:00 ` Pavel Begunkov
2021-08-15 3:31 ` Josh Triplett
2021-08-15 10:48 ` Pavel Begunkov
2021-08-15 14:23 ` Pavel Begunkov [this message]
2021-08-16 15:45 ` Stefan Metzmacher
2021-08-17 9:33 ` Pavel Begunkov
2021-08-17 14:57 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox