public inbox for [email protected]
 help / color / mirror / Atom feed
From: Jens Axboe <[email protected]>
To: Ming Lei <[email protected]>
Cc: Pavel Begunkov <[email protected]>,
	[email protected], [email protected],
	Uday Shankar <[email protected]>,
	Akilesh Kailash <[email protected]>
Subject: Re: [PATCH V8 0/8] io_uring: support sqe group and leased group kbuf
Date: Tue, 29 Oct 2024 20:43:39 -0600	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <ZyGT3h5jNsKB0mrZ@fedora>

On 10/29/24 8:03 PM, Ming Lei wrote:
> On Tue, Oct 29, 2024 at 03:26:37PM -0600, Jens Axboe wrote:
>> On 10/29/24 2:06 PM, Jens Axboe wrote:
>>> On 10/29/24 1:18 PM, Jens Axboe wrote:
>>>> Now, this implementation requires a user buffer, and as far as I'm told,
>>>> you currently have kernel buffers on the ublk side. There's absolutely
>>>> no reason why kernel buffers cannot work, we'd most likely just need to
>>>> add a IORING_RSRC_KBUFFER type to handle that. My question here is how
>>>> hard is this requirement? Reason I ask is that it's much simpler to work
>>>> with userspace buffers. Yes the current implementation maps them
>>>> everytime, we could certainly change that, however I don't see this
>>>> being an issue. It's really no different than O_DIRECT, and you only
>>>> need to map them once for a read + whatever number of writes you'd need
>>>> to do. If a 'tag' is provided for LOCAL_BUF, it'll post a CQE whenever
>>>> that buffer is unmapped. This is a notification for the application that
>>>> it's done using the buffer. For a pure kernel buffer, we'd either need
>>>> to be able to reference it (so that we KNOW it's not going away) and/or
>>>> have a callback associated with the buffer.
>>>
>>> Just to expand on this - if a kernel buffer is absolutely required, for
>>> example if you're inheriting pages from the page cache or other
>>> locations you cannot control, we would need to add something ala the
>>> below:
>>
>> Here's a more complete one, but utterly untested. But it does the same
>> thing, mapping a struct request, but it maps it to an io_rsrc_node which
>> in turn has an io_mapped_ubuf in it. Both BUFFER and KBUFFER use the
>> same type, only the destruction is different. Then the callback provided
>> needs to do something ala:
>>
>> struct io_mapped_ubuf *imu = node->buf;
>>
>> if (imu && refcount_dec_and_test(&imu->refs))
>> 	kvfree(imu);
>>
>> when it's done with the imu. Probably an rsrc helper should just be done
>> for that, but those are details.
>>
>> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>> index 9621ba533b35..050868a4c9f1 100644
>> --- a/io_uring/rsrc.c
>> +++ b/io_uring/rsrc.c
>> @@ -8,6 +8,8 @@
>>  #include <linux/nospec.h>
>>  #include <linux/hugetlb.h>
>>  #include <linux/compat.h>
>> +#include <linux/bvec.h>
>> +#include <linux/blk-mq.h>
>>  #include <linux/io_uring.h>
>>  
>>  #include <uapi/linux/io_uring.h>
>> @@ -474,6 +476,9 @@ void io_free_rsrc_node(struct io_rsrc_node *node)
>>  		if (node->buf)
>>  			io_buffer_unmap(node->ctx, node);
>>  		break;
>> +	case IORING_RSRC_KBUFFER:
>> +		node->kbuf_fn(node);
>> +		break;
> 
> Here 'node' is freed later, and it may not work because ->imu is bound
> with node.

Not sure why this matters? imu can be bound to any node (and has a
separate ref), but the node will remain for as long as the submission
runs. It has to, because the last reference is put when submission of
all requests in that series ends.

>> @@ -1070,6 +1075,65 @@ int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg)
>>  	return ret;
>>  }
>>  
>> +struct io_rsrc_node *io_rsrc_map_request(struct io_ring_ctx *ctx,
>> +					 struct request *req,
>> +					 void (*kbuf_fn)(struct io_rsrc_node *))
>> +{
>> +	struct io_mapped_ubuf *imu = NULL;
>> +	struct io_rsrc_node *node = NULL;
>> +	struct req_iterator rq_iter;
>> +	unsigned int offset;
>> +	struct bio_vec bv;
>> +	int nr_bvecs;
>> +
>> +	if (!bio_has_data(req->bio))
>> +		goto out;
>> +
>> +	nr_bvecs = 0;
>> +	rq_for_each_bvec(bv, req, rq_iter)
>> +		nr_bvecs++;
>> +	if (!nr_bvecs)
>> +		goto out;
>> +
>> +	node = io_rsrc_node_alloc(ctx, IORING_RSRC_KBUFFER);
>> +	if (!node)
>> +		goto out;
>> +	node->buf = NULL;
>> +
>> +	imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_NOIO);
>> +	if (!imu)
>> +		goto out;
>> +
>> +	imu->ubuf = 0;
>> +	imu->len = 0;
>> +	if (req->bio != req->biotail) {
>> +		int idx = 0;
>> +
>> +		offset = 0;
>> +		rq_for_each_bvec(bv, req, rq_iter) {
>> +			imu->bvec[idx++] = bv;
>> +			imu->len += bv.bv_len;
>> +		}
>> +	} else {
>> +		struct bio *bio = req->bio;
>> +
>> +		offset = bio->bi_iter.bi_bvec_done;
>> +		imu->bvec[0] = *__bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
>> +		imu->len = imu->bvec[0].bv_len;
>> +	}
>> +	imu->nr_bvecs = nr_bvecs;
>> +	imu->folio_shift = PAGE_SHIFT;
>> +	refcount_set(&imu->refs, 1);
> 
> One big problem is how to initialize the reference count, because this
> buffer need to be used in the following more than one request. Without
> one perfect counter, the buffer won't be freed in the exact time without
> extra OP.

Each request that uses the node, will grab a reference to the node. The
node holds a reference to the buffer. So at least as the above works,
the buf will be put when submission ends, as that puts the node and
subsequently the one reference the imu has by default. It'll outlast any
of the requests that use it during submission, and there cannot be any
other users of it as it isn't discoverable outside of that.

> I think the reference should be in `node` which need to be live if any
> consumer OP isn't completed.

That is how it works... io_req_assign_rsrc_node() will assign a node to
a request, which will be there until the request completes.

>> +	node->buf = imu;
>> +	node->kbuf_fn = kbuf_fn;
>> +	return node;
> 
> Also this function needs to register the buffer to table with one
> pre-defined buf index, then the following request can use it by
> the way of io_prep_rw_fixed().

It should not register it with the table, the whole point is to keep
this node only per-submission discoverable. If you're grabbing random
request pages, then it very much is a bit finicky and needs to be of
limited scope.

Each request type would need to support it. For normal read/write, I'd
suggest just adding IORING_OP_READ_LOCAL and WRITE_LOCAL to do that.

> If OP dependency can be avoided, I think this approach is fine,
> otherwise I still suggest sqe group. Not only performance, but
> application becomes too complicated.

You could avoid the OP dependency with just a flag, if you really wanted
to. But I'm not sure it makes a lot of sense. And it's a hell of a lot
simpler than the sqe group scheme, which I'm a bit worried about as it's
a bit complicated in how deep it needs to go in the code. This one
stands alone, so I'd strongly encourage we pursue this a bit further and
iron out the kinks. Maybe it won't work in the end, I don't know, but it
seems pretty promising and it's soooo much simpler.

> We also we need to provide ->prep() callback for uring_cmd driver, so
> that io_rsrc_map_request() can be called by driver in ->prep(),
> meantime `io_ring_ctx` and `io_rsrc_node` need to be visible for driver.
> What do you think of these kind of changes?

io_ring_ctx is already visible in the normal system headers,
io_rsrc_node we certainly could make visible. That's not a big deal. It
makes a lot more sense to export than some of the other stuff we have in
there! As long as it's all nicely handled by helpers, then we'd be fine.

-- 
Jens Axboe

  reply	other threads:[~2024-10-30  2:43 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-25 12:22 [PATCH V8 0/8] io_uring: support sqe group and leased group kbuf Ming Lei
2024-10-25 12:22 ` [PATCH V8 1/7] io_uring: add io_link_req() helper Ming Lei
2024-10-25 12:22 ` [PATCH V8 2/7] io_uring: add io_submit_fail_link() helper Ming Lei
2024-10-25 12:22 ` [PATCH V8 3/7] io_uring: add helper of io_req_commit_cqe() Ming Lei
2024-10-25 12:22 ` [PATCH V8 4/7] io_uring: support SQE group Ming Lei
2024-10-29  0:12   ` Jens Axboe
2024-10-29  1:50     ` Ming Lei
2024-10-29 16:38       ` Pavel Begunkov
2024-10-25 12:22 ` [PATCH V8 5/7] io_uring: support leased group buffer with REQ_F_GROUP_KBUF Ming Lei
2024-10-29 16:47   ` Pavel Begunkov
2024-10-30  0:45     ` Ming Lei
2024-10-30  1:25       ` Pavel Begunkov
2024-10-30  2:04         ` Ming Lei
2024-10-25 12:22 ` [PATCH V8 6/7] io_uring/uring_cmd: support leasing device kernel buffer to io_uring Ming Lei
2024-10-25 12:22 ` [PATCH V8 7/7] ublk: support leasing io " Ming Lei
2024-10-29 17:01 ` [PATCH V8 0/8] io_uring: support sqe group and leased group kbuf Pavel Begunkov
2024-10-29 17:04   ` Jens Axboe
2024-10-29 19:18     ` Jens Axboe
2024-10-29 20:06       ` Jens Axboe
2024-10-29 21:26         ` Jens Axboe
2024-10-30  2:03           ` Ming Lei
2024-10-30  2:43             ` Jens Axboe [this message]
2024-10-30  3:08               ` Ming Lei
2024-10-30  4:11                 ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox