From: Pavel Begunkov <[email protected]>
To: Bijan Mottahedeh <[email protected]>, [email protected]
Cc: [email protected]
Subject: Re: [PATCH 7/8] io_uring: support readv/writev with fixed buffers
Date: Wed, 18 Nov 2020 09:14:17 +0000 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 17/11/2020 22:59, Bijan Mottahedeh wrote:
>>> Support readv/writev with fixed buffers, and introduce IOSQE_FIXED_BUFFER,
>>> consistent with fixed files.
>>
>> I don't like it at all, see issues below. The actual implementation would
>> be much uglier.
>>
>> I propose you to split the series and push separately. Your first 6 patches
>> first, I don't have conceptual objections to them. Then registration sharing
>> (I still need to look it up). And then we can return to this, if you're not
>> yet convinced.
>
> Ok. The sharing patch is actually the highest priority for us so it'd be great to know if you think it's in the right direction.
>
> Should I submit them as they are or address your fixed_file_ref comments in Patch 4/8 as well? Would I need your prep patch beforehand?
I remembered one more thing that I need to do for your patches to work.
I'll ping you afterwards
>
>>> +static ssize_t io_import_iovec_fixed(int rw, struct io_kiocb *req, void *buf,
>>> + unsigned segs, unsigned fast_segs,
>>> + struct iovec **iovec,
>>> + struct iov_iter *iter)
>>> +{
>>> + struct io_ring_ctx *ctx = req->ctx;
>>> + struct io_mapped_ubuf *imu;
>>> + struct iovec *iov;
>>> + u16 index, buf_index;
>>> + ssize_t ret;
>>> + unsigned long seg;
>>> +
>>> + if (unlikely(!ctx->buf_data))
>>> + return -EFAULT;
>>> +
>>> + ret = import_iovec(rw, buf, segs, fast_segs, iovec, iter);
>>
>> Did you test it? import_iovec() does access_ok() against each iov_base,
>> which in your case are an index.
>
> I used liburing:test/file-{register,update} as models for the equivalent buffer tests and they seem to work. I can send out the tests and the liburing changes if you want.
Hmm, seems access_ok() is no-op for many architectures.
Still IIRC it's not portable.
>
> The fixed io test registers an empty iov table first:
>
> ret = io_uring_register_buffers(ring, iovs, UIO_MAXIOV);
>
> It next updates the table with two actual buffers at offset 768:
>
> ret = io_uring_register_buffers_update(ring, 768, ups, 2);
>
> It later uses the buffer at index 768 for writev similar to the file-register test (IOSQE_FIXED_BUFFER instead of IOSQE_FIXED_FILE):
>
> iovs[768].iov_base = (void *)768;
> iovs[768].iov_len = pagesize;
>
>
> io_uring_prep_writev(sqe, fd, iovs, 1, 0);
> sqe->flags |= IOSQE_FIXED_BUFFER;
>
> ret = io_uring_submit(ring);
>
> Below is the iovec returned from
>
> io_import_iovec_fixed()
> -> io_import_vec()
>
> {iov_base = 0x300 <dm_early_create+412>, iov_len = 4096}
>
>>> + if (ret < 0)
>>> + return ret;
>>> +
>>> + iov = (struct iovec *)iter->iov;
>>> +
>>> + for (seg = 0; seg < iter->nr_segs; seg++) {
>>> + buf_index = *(u16 *)(&iov[seg].iov_base);
>>
>> That's ugly, and also not consistent with rw_fixed, because iov_base is
>> used to calculate offset.
>
> Would offset be applicable when using readv/writev?
Not sure what you mean. You can register a huge chunk of memory,
but with buf_index alone you can't select a subchunk. E.g.
void *buf = alloc(4G);
idx = io_uring_register_buf(buf);
uring_read_fixed(reg_buf=idx,
data=buf+100, // offset 100
size=sz);
This writes [buf+100, buf+100+sz]. Without passing @buf you
wouldn't be able to specify an offset. Alternatively, the
API could have been accepting an offset directly, but this
option was chosen.
There is no such a problem with non-registered versions, it
just writes/reads all iov.
>
> My thinkig was that for those cases, each iovec should be used exactly as registered.
It may be confusing, but read/write_fixed use iov_base to calculate offset.
i.e.
imu = ctx->bufs[req->buffer_index];
if (iov_base not in range(imu->original_addr, imu->len))
fail;
offset = iov_base - imu->original_addr;
>
>>
>>> + if (unlikely(buf_index < 0 || buf_index >= ctx->nr_user_bufs))
>>> + return -EFAULT;
>>> +
>>> + index = array_index_nospec(buf_index, ctx->nr_user_bufs);
>>> + imu = io_buf_from_index(ctx, index);
>>> + if (!imu->ubuf || !imu->len)
>>> + return -EFAULT;
>>> + if (iov[seg].iov_len > imu->len)
>>> + return -EFAULT;
>>> +
>>> + iov[seg].iov_base = (void *)imu->ubuf;
>>
>> Nope, that's not different from non registered version.
>> What import_fixed actually do is setting up the iter argument to point
>> to a bvec (a vector of struct page *).
>
> So in fact, the buffers end up being pinned again because they are not being as bvecs?
right, it just passes iov with userspace virtual addresses in your case
and layer below don't know that they're pinned. And as they're virtual
in most cases they have to be translated to physical (that's solved by
having a vector of pages).
>
>>
>> So it either would need to keep a vector of bvecs, that's a vector of vectors,
>> that's not supported by iter, etc., so you'll also need to iterate over them
>> in io_read/write and so on. Or flat 2D structure into 1D, but that's still ugly.
>
> So you're saying there's no clean way to create a readv/writev + fixed buffers API? It would've been nice to have a consistent API between files and buffers.
I guess you can register an iov as a single bvec by flatting out the structure
to a single vector (bvec). Though, we'd need to drop an assumption that all
but first and last entries are page sized.
Or do that online with extra overhead + allocations.
>
>
>>> @@ -5692,7 +5743,7 @@ static int io_timeout_remove_prep(struct io_kiocb *req,
>>> {
>>> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
>>> return -EINVAL;
>>> - if (unlikely(req->flags & (REQ_F_FIXED_FILE | REQ_F_BUFFER_SELECT)))
>>> + if (unlikely(req->flags & (REQ_F_FIXED_RSRC | REQ_F_BUFFER_SELECT)))
>>
>> Why it's here?
>>
>> #define REQ_F_FIXED_RSRC (REQ_F_FIXED_FILE | REQ_F_FIXED_BUFFER)
>> So, why do you | with REQ_F_BUFFER_SELECT again here?
>
> I thought to group fixed files/buffers but distinguish them from selected buffers.
Ah, I've got this one wrong.
>
>>> @@ -87,6 +88,8 @@ enum {
>>> #define IOSQE_ASYNC (1U << IOSQE_ASYNC_BIT)
>>> /* select buffer from sqe->buf_group */
>>> #define IOSQE_BUFFER_SELECT (1U << IOSQE_BUFFER_SELECT_BIT)
>>> +/* use fixed buffer set */
>>> +#define IOSQE_FIXED_BUFFER (1U << IOSQE_FIXED_BUFFER_BIT)
>>
>> Unfortenatuly, we're almost out of flags bits -- it's a 1 byte
>> field and 6 bits are already taken. Let's not use it.
--
Pavel Begunkov
next prev parent reply other threads:[~2020-11-18 9:17 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-12 23:00 [PATCH 0/8] io_uring: buffer registration enhancements Bijan Mottahedeh
2020-11-12 23:00 ` [PATCH 1/8] io_uring: modularize io_sqe_buffer_register Bijan Mottahedeh
2020-11-12 23:00 ` [PATCH 2/8] io_uring: modularize io_sqe_buffers_register Bijan Mottahedeh
2020-11-12 23:00 ` [PATCH 3/8] io_uring: generalize fixed file functionality Bijan Mottahedeh
2020-11-12 23:00 ` [PATCH 4/8] io_uring: implement fixed buffers registration similar to fixed files Bijan Mottahedeh
2020-11-15 13:33 ` Pavel Begunkov
2020-11-16 21:24 ` Bijan Mottahedeh
2020-11-16 23:09 ` Pavel Begunkov
2020-11-17 0:41 ` Bijan Mottahedeh
2020-11-12 23:00 ` [PATCH 5/8] io_uring: generalize files_update functionlity to rsrc_update Bijan Mottahedeh
2020-11-12 23:00 ` [PATCH 6/8] io_uring: support buffer registration updates Bijan Mottahedeh
2020-11-18 20:17 ` Pavel Begunkov
2020-12-09 0:42 ` Bijan Mottahedeh
2020-11-12 23:00 ` [PATCH 7/8] io_uring: support readv/writev with fixed buffers Bijan Mottahedeh
2020-11-17 11:04 ` Pavel Begunkov
2020-11-17 22:59 ` Bijan Mottahedeh
2020-11-18 9:14 ` Pavel Begunkov [this message]
2020-11-18 20:12 ` Pavel Begunkov
[not found] ` <[email protected]>
[not found] ` <[email protected]>
2020-11-19 19:27 ` Bijan Mottahedeh
2020-11-12 23:00 ` [PATCH 8/8] io_uring: support buffer registration sharing Bijan Mottahedeh
2020-11-16 23:28 ` [PATCH 0/8] io_uring: buffer registration enhancements Pavel Begunkov
2020-11-17 0:21 ` Bijan Mottahedeh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox