From: Hao Xu <[email protected]>
To: Jens Axboe <[email protected]>, [email protected]
Cc: [email protected], Dylan Yudaken <[email protected]>
Subject: Re: [PATCH 3/3] io_uring: add support for ring mapped supplied buffers
Date: Wed, 18 May 2022 18:50:42 +0800 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 5/17/22 23:46, Jens Axboe wrote:
> On 5/17/22 8:18 AM, Hao Xu wrote:
>> Hi All,
>>
>> On 5/17/22 00:21, Jens Axboe wrote:
>>> Provided buffers allow an application to supply io_uring with buffers
>>> that can then be grabbed for a read/receive request, when the data
>>> source is ready to deliver data. The existing scheme relies on using
>>> IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use
>>> in real world applications. It's pretty efficient if the application
>>> is able to supply back batches of provided buffers when they have been
>>> consumed and the application is ready to recycle them, but if
>>> fragmentation occurs in the buffer space, it can become difficult to
>>> supply enough buffers at the time. This hurts efficiency.
>>>
>>> Add a register op, IORING_REGISTER_PBUF_RING, which allows an application
>>> to setup a shared queue for each buffer group of provided buffers. The
>>> application can then supply buffers simply by adding them to this ring,
>>> and the kernel can consume then just as easily. The ring shares the head
>>> with the application, the tail remains private in the kernel.
>>>
>>> Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use
>>> IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the
>>> ring, they must use the mapped ring. Mapped provided buffer rings can
>>> co-exist with normal provided buffers, just not within the same group ID.
>>>
>>> To gauge overhead of the existing scheme and evaluate the mapped ring
>>> approach, a simple NOP benchmark was written. It uses a ring of 128
>>> entries, and submits/completes 32 at the time. 'Replenish' is how
>>> many buffers are provided back at the time after they have been
>>> consumed:
>>>
>>> Test Replenish NOPs/sec
>>> ================================================================
>>> No provided buffers NA ~30M
>>> Provided buffers 32 ~16M
>>> Provided buffers 1 ~10M
>>> Ring buffers 32 ~27M
>>> Ring buffers 1 ~27M
>>>
>>> The ring mapped buffers perform almost as well as not using provided
>>> buffers at all, and they don't care if you provided 1 or more back at
>>> the same time. This means application can just replenish as they go,
>>> rather than need to batch and compact, further reducing overhead in the
>>> application. The NOP benchmark above doesn't need to do any compaction,
>>> so that overhead isn't even reflected in the above test.
>>>
>>> Co-developed-by: Dylan Yudaken <[email protected]>
>>> Signed-off-by: Jens Axboe <[email protected]>
>>> ---
>>> fs/io_uring.c | 233 ++++++++++++++++++++++++++++++++--
>>> include/uapi/linux/io_uring.h | 36 ++++++
>>> 2 files changed, 257 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 5867dcabc73b..776a9f5e5ec7 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -285,9 +285,26 @@ struct io_rsrc_data {
>>> bool quiesce;
>>> };
>>> +#define IO_BUFFER_LIST_BUF_PER_PAGE (PAGE_SIZE / sizeof(struct io_uring_buf))
>>> struct io_buffer_list {
>>> - struct list_head buf_list;
>>> + /*
>>> + * If ->buf_nr_pages is set, then buf_pages/buf_ring are used. If not,
>>> + * then these are classic provided buffers and ->buf_list is used.
>>> + */
>>> + union {
>>> + struct list_head buf_list;
>>> + struct {
>>> + struct page **buf_pages;
>>> + struct io_uring_buf_ring *buf_ring;
>>> + };
>>> + };
>>> __u16 bgid;
>>> +
>>> + /* below is for ring provided buffers */
>>> + __u16 buf_nr_pages;
>>> + __u16 nr_entries;
>>> + __u32 tail;
>>> + __u32 mask;
>>> };
>>> struct io_buffer {
>>> @@ -804,6 +821,7 @@ enum {
>>> REQ_F_NEED_CLEANUP_BIT,
>>> REQ_F_POLLED_BIT,
>>> REQ_F_BUFFER_SELECTED_BIT,
>>> + REQ_F_BUFFER_RING_BIT,
>>> REQ_F_COMPLETE_INLINE_BIT,
>>> REQ_F_REISSUE_BIT,
>>> REQ_F_CREDS_BIT,
>>> @@ -855,6 +873,8 @@ enum {
>>> REQ_F_POLLED = BIT(REQ_F_POLLED_BIT),
>>> /* buffer already selected */
>>> REQ_F_BUFFER_SELECTED = BIT(REQ_F_BUFFER_SELECTED_BIT),
>>> + /* buffer selected from ring, needs commit */
>>> + REQ_F_BUFFER_RING = BIT(REQ_F_BUFFER_RING_BIT),
>>> /* completion is deferred through io_comp_state */
>>> REQ_F_COMPLETE_INLINE = BIT(REQ_F_COMPLETE_INLINE_BIT),
>>> /* caller should reissue async */
>>> @@ -979,6 +999,12 @@ struct io_kiocb {
>>> /* stores selected buf, valid IFF REQ_F_BUFFER_SELECTED is set */
>>> struct io_buffer *kbuf;
>>> +
>>> + /*
>>> + * stores buffer ID for ring provided buffers, valid IFF
>>> + * REQ_F_BUFFER_RING is set.
>>> + */
>>> + struct io_buffer_list *buf_list;
>>> };
>>> union {
>>> @@ -1470,8 +1496,14 @@ static inline void io_req_set_rsrc_node(struct io_kiocb *req,
>>> static unsigned int __io_put_kbuf(struct io_kiocb *req, struct list_head *list)
>>> {
>>> - req->flags &= ~REQ_F_BUFFER_SELECTED;
>>> - list_add(&req->kbuf->list, list);
>>> + if (req->flags & REQ_F_BUFFER_RING) {
>>> + if (req->buf_list)
>>> + req->buf_list->tail++;
>>
>> This confused me for some time..seems [tail, head) is the registered
>> bufs that kernel space can leverage? similar to what pipe logic does.
>> how about swaping the name of head and tail, this way setting the kernel
>> as a consumer. But this is just my personal preference..
>
> No agree, I'll make that change. That matches the sq ring as well, which
> is the same user producer, kernel consumer setup.
>
>>> + tail &= bl->mask;
>>> + if (tail < IO_BUFFER_LIST_BUF_PER_PAGE) {
>>> + buf = &br->bufs[tail];
>>> + } else {
>>> + int off = tail & (IO_BUFFER_LIST_BUF_PER_PAGE - 1);
>>> + int index = tail / IO_BUFFER_LIST_BUF_PER_PAGE - 1;
>>
>> Could we do some bitwise trick with some compiler check there since for
>> now IO_BUFFER_LIST_BUF_PER_PAGE is a power of 2.
>
> This is known at compile time, so the compiler should already be doing
> that as it's a constant.
>
>>> + buf = page_address(bl->buf_pages[index]);
>>> + buf += off;
>>> + }
>>
>> I'm not familiar with this part, allow me to ask, is this if else
>> statement for efficiency? why choose one page as the dividing line
>
> We need to index at the right page granularity.
Sorry, I didn't get it, why can't we just do buf = &br->bufs[tail];
It seems something is beyond my knowledge..
>
next prev parent reply other threads:[~2022-05-18 10:51 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-16 16:21 [PATCHSET v6 0/3] Add support for ring mapped provided buffers Jens Axboe
2022-05-16 16:21 ` [PATCH 1/3] io_uring: add buffer selection support to IORING_OP_NOP Jens Axboe
2022-05-16 16:21 ` [PATCH 2/3] io_uring: add io_pin_pages() helper Jens Axboe
2022-05-16 16:21 ` [PATCH 3/3] io_uring: add support for ring mapped supplied buffers Jens Axboe
2022-05-17 14:18 ` Hao Xu
2022-05-17 15:46 ` Jens Axboe
2022-05-18 10:50 ` Hao Xu [this message]
2022-05-18 12:48 ` Jens Axboe
2022-05-17 14:20 ` [PATCHSET v6 0/3] Add support for ring mapped provided buffers Hao Xu
2022-05-17 15:44 ` Jens Axboe
-- strict thread matches above, loose matches on Subject: below --
2022-05-07 14:30 [PATCHSET v5 " Jens Axboe
2022-05-07 14:30 ` [PATCH 3/3] io_uring: add support for ring mapped supplied buffers Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox