Large CQE for fuse headers

public inbox for [email protected]
 help / color / mirror / Atom feed

* Large CQE for fuse headers
@ 2024-10-10 20:56 Bernd Schubert
  2024-10-11 17:57 ` Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Bernd Schubert @ 2024-10-10 20:56 UTC (permalink / raw)
  To: io-uring
  Cc: Pavel Begunkov, Jens Axboe, Miklos Szeredi, Joanne Koong,
	Josef Bacik

Hello,

as discussed during LPC, we would like to have large CQE sizes, at least
256B. Ideally 256B for fuse, but CQE512 might be a bit too much...

Pavel said that this should be ok, but it would be better to have the CQE
size as function argument. 
Could you give me some hints how this should look like and especially how
we are going to communicate the CQE size to the kernel? I guess just adding
IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier.

I'm basically through with other changes Miklos had been asking for and
moving fuse headers into the CQE is next.

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-10 20:56 Large CQE for fuse headers Bernd Schubert
@ 2024-10-11 17:57 ` Jens Axboe
  2024-10-11 18:35   ` Bernd Schubert
  2024-10-11 21:38 ` Pavel Begunkov
  2024-10-12  1:55 ` Ming Lei
  2 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2024-10-11 17:57 UTC (permalink / raw)
  To: Bernd Schubert, io-uring
  Cc: Pavel Begunkov, Miklos Szeredi, Joanne Koong, Josef Bacik

On 10/10/24 2:56 PM, Bernd Schubert wrote:
> Hello,
> 
> as discussed during LPC, we would like to have large CQE sizes, at least
> 256B. Ideally 256B for fuse, but CQE512 might be a bit too much...
> 
> Pavel said that this should be ok, but it would be better to have the CQE
> size as function argument. 
> Could you give me some hints how this should look like and especially how
> we are going to communicate the CQE size to the kernel? I guess just adding
> IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier.

Not Pavel and unfortunately I could not be at that LPC discussion, but
yeah I don't see why not just adding the necessary SETUP arg for this
would not be the way to go. As long as they are power-of-2, then all
it'll impact on both the kernel and liburing side is what size shift to
use when iterating CQEs.

Since this obviously means larger CQ rings, one nice side effect is that
since 6.10 we don't need contig pages to map any of the rings. So should
work just fine regardless of memory fragmentation, where previously that
would've been a concern.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-11 17:57 ` Jens Axboe
@ 2024-10-11 18:35   ` Bernd Schubert
  2024-10-11 18:39     ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Bernd Schubert @ 2024-10-11 18:35 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: Pavel Begunkov, Miklos Szeredi, Joanne Koong, Josef Bacik



On 10/11/24 19:57, Jens Axboe wrote:
> On 10/10/24 2:56 PM, Bernd Schubert wrote:
>> Hello,
>>
>> as discussed during LPC, we would like to have large CQE sizes, at least
>> 256B. Ideally 256B for fuse, but CQE512 might be a bit too much...
>>
>> Pavel said that this should be ok, but it would be better to have the CQE
>> size as function argument. 
>> Could you give me some hints how this should look like and especially how
>> we are going to communicate the CQE size to the kernel? I guess just adding
>> IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier.
> 
> Not Pavel and unfortunately I could not be at that LPC discussion, but
> yeah I don't see why not just adding the necessary SETUP arg for this
> would not be the way to go. As long as they are power-of-2, then all
> it'll impact on both the kernel and liburing side is what size shift to
> use when iterating CQEs.

Thanks, Pavel also wanted power-of-2, although 512 is a bit much for fuse. 
Well, maybe 256 will be sufficient. Going to look into adding that parameter
during the next days.

> 
> Since this obviously means larger CQ rings, one nice side effect is that
> since 6.10 we don't need contig pages to map any of the rings. So should
> work just fine regardless of memory fragmentation, where previously that
> would've been a concern.
> 

Out of interest, what is the change? Up to fuse-io-uring rfc2 I was
vmalloced buffers for fuse that got mmaped - was working fine. Miklos just
wants to avoid that kernel allocates large chunks of memory on behalf of
users.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-11 18:35   ` Bernd Schubert
@ 2024-10-11 18:39     ` Jens Axboe
  2024-10-11 19:03       ` Bernd Schubert
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2024-10-11 18:39 UTC (permalink / raw)
  To: Bernd Schubert, io-uring
  Cc: Pavel Begunkov, Miklos Szeredi, Joanne Koong, Josef Bacik

On 10/11/24 12:35 PM, Bernd Schubert wrote:
> On 10/11/24 19:57, Jens Axboe wrote:
>> On 10/10/24 2:56 PM, Bernd Schubert wrote:
>>> Hello,
>>>
>>> as discussed during LPC, we would like to have large CQE sizes, at least
>>> 256B. Ideally 256B for fuse, but CQE512 might be a bit too much...
>>>
>>> Pavel said that this should be ok, but it would be better to have the CQE
>>> size as function argument. 
>>> Could you give me some hints how this should look like and especially how
>>> we are going to communicate the CQE size to the kernel? I guess just adding
>>> IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier.
>>
>> Not Pavel and unfortunately I could not be at that LPC discussion, but
>> yeah I don't see why not just adding the necessary SETUP arg for this
>> would not be the way to go. As long as they are power-of-2, then all
>> it'll impact on both the kernel and liburing side is what size shift to
>> use when iterating CQEs.
> 
> Thanks, Pavel also wanted power-of-2, although 512 is a bit much for fuse. 
> Well, maybe 256 will be sufficient. Going to look into adding that parameter
> during the next days.

We really have to keep it pow-of-2 just to avoid convoluting the logic
(and overhead) of iterating the CQ ring and CQEs. You can search for
IORING_SETUP_CQE32 in the kernel to see how it's just a shift, and ditto
on the liburing side.

Curious, what's all the space needed for?

>> Since this obviously means larger CQ rings, one nice side effect is that
>> since 6.10 we don't need contig pages to map any of the rings. So should
>> work just fine regardless of memory fragmentation, where previously that
>> would've been a concern.
>>
> 
> Out of interest, what is the change? Up to fuse-io-uring rfc2 I was
> vmalloced buffers for fuse that got mmaped - was working fine. Miklos just
> wants to avoid that kernel allocates large chunks of memory on behalf of
> users.

It was the change that got rid of remap_pfn_range() for mapping, and
switched to vm_insert_page(s) instead. Memory overhead should generally
not be too bad, it's all about sizing the rings appropriately. The much
bigger concern is needing contig memory, as that can become scarce after
longer uptimes, even with plenty of memory free. This is particularly
important if you need 512b CQEs, obviously.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-11 18:39     ` Jens Axboe
@ 2024-10-11 19:03       ` Bernd Schubert
  2024-10-11 19:24         ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Bernd Schubert @ 2024-10-11 19:03 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: Pavel Begunkov, Miklos Szeredi, Joanne Koong, Josef Bacik



On 10/11/24 20:39, Jens Axboe wrote:
> On 10/11/24 12:35 PM, Bernd Schubert wrote:
>> On 10/11/24 19:57, Jens Axboe wrote:
>>> On 10/10/24 2:56 PM, Bernd Schubert wrote:
>>>> Hello,
>>>>
>>>> as discussed during LPC, we would like to have large CQE sizes, at least
>>>> 256B. Ideally 256B for fuse, but CQE512 might be a bit too much...
>>>>
>>>> Pavel said that this should be ok, but it would be better to have the CQE
>>>> size as function argument. 
>>>> Could you give me some hints how this should look like and especially how
>>>> we are going to communicate the CQE size to the kernel? I guess just adding
>>>> IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier.
>>>
>>> Not Pavel and unfortunately I could not be at that LPC discussion, but
>>> yeah I don't see why not just adding the necessary SETUP arg for this
>>> would not be the way to go. As long as they are power-of-2, then all
>>> it'll impact on both the kernel and liburing side is what size shift to
>>> use when iterating CQEs.
>>
>> Thanks, Pavel also wanted power-of-2, although 512 is a bit much for fuse. 
>> Well, maybe 256 will be sufficient. Going to look into adding that parameter
>> during the next days.
> 
> We really have to keep it pow-of-2 just to avoid convoluting the logic
> (and overhead) of iterating the CQ ring and CQEs. You can search for
> IORING_SETUP_CQE32 in the kernel to see how it's just a shift, and ditto
> on the liburing side.

Thanks, going to look into it.

> 
> Curious, what's all the space needed for?

The basic fuse header: struct fuse_in_header -> current 40B
and per request header headers, I think current max is 64.

And then some extra compat space for both, so that they can be safely
extended in the future (which is currently an issue).


> 
>>> Since this obviously means larger CQ rings, one nice side effect is that
>>> since 6.10 we don't need contig pages to map any of the rings. So should
>>> work just fine regardless of memory fragmentation, where previously that
>>> would've been a concern.
>>>
>>
>> Out of interest, what is the change? Up to fuse-io-uring rfc2 I was
>> vmalloced buffers for fuse that got mmaped - was working fine. Miklos just
>> wants to avoid that kernel allocates large chunks of memory on behalf of
>> users.
> 
> It was the change that got rid of remap_pfn_range() for mapping, and
> switched to vm_insert_page(s) instead. Memory overhead should generally
> not be too bad, it's all about sizing the rings appropriately. The much
> bigger concern is needing contig memory, as that can become scarce after
> longer uptimes, even with plenty of memory free. This is particularly
> important if you need 512b CQEs, obviously.
> 

For sure, I was just curious what you had changed. I think I had looked into
that io-uring code around 2 years ago.  Going to look into the update
io-uring code, thanks for the hint.
For fuse I was just using remap_vmalloc_range().

https://lore.kernel.org/all/[email protected]/


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-11 19:03       ` Bernd Schubert
@ 2024-10-11 19:24         ` Jens Axboe
  0 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2024-10-11 19:24 UTC (permalink / raw)
  To: Bernd Schubert, io-uring
  Cc: Pavel Begunkov, Miklos Szeredi, Joanne Koong, Josef Bacik

On 10/11/24 1:03 PM, Bernd Schubert wrote:
>>
>> Curious, what's all the space needed for?
> 
> The basic fuse header: struct fuse_in_header -> current 40B
> and per request header headers, I think current max is 64.
> 
> And then some extra compat space for both, so that they can be safely
> extended in the future (which is currently an issue).

So that's 104b, and regular CQE stuff too I presume, so that's 104+16 ==
120 bytes. That'd fit in a 128b CQE, and 256b would be pleeeeenty? Just
squeeze a version field or something in there so you know what the
version is for future proofing? I would strongly recommend making it as
large as you need it for those things, but no longer just for
compat/future reasons. Eg 128b over 256b is a win for sure, and 256b
over 512b is a REALLY nice win.

>>>> Since this obviously means larger CQ rings, one nice side effect is that
>>>> since 6.10 we don't need contig pages to map any of the rings. So should
>>>> work just fine regardless of memory fragmentation, where previously that
>>>> would've been a concern.
>>>>
>>>
>>> Out of interest, what is the change? Up to fuse-io-uring rfc2 I was
>>> vmalloced buffers for fuse that got mmaped - was working fine. Miklos just
>>> wants to avoid that kernel allocates large chunks of memory on behalf of
>>> users.
>>
>> It was the change that got rid of remap_pfn_range() for mapping, and
>> switched to vm_insert_page(s) instead. Memory overhead should generally
>> not be too bad, it's all about sizing the rings appropriately. The much
>> bigger concern is needing contig memory, as that can become scarce after
>> longer uptimes, even with plenty of memory free. This is particularly
>> important if you need 512b CQEs, obviously.
>>
> 
> For sure, I was just curious what you had changed. I think I had looked into
> that io-uring code around 2 years ago.  Going to look into the update
> io-uring code, thanks for the hint.
> For fuse I was just using remap_vmalloc_range().
> 
> https://lore.kernel.org/all/[email protected]/

That's the one to use, io_uring was just stuck with using the wrong API
for quite a while, but that got sorted.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-10 20:56 Large CQE for fuse headers Bernd Schubert
  2024-10-11 17:57 ` Jens Axboe
@ 2024-10-11 21:38 ` Pavel Begunkov
  2024-10-12  1:55 ` Ming Lei
  2 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2024-10-11 21:38 UTC (permalink / raw)
  To: Bernd Schubert, io-uring
  Cc: Jens Axboe, Miklos Szeredi, Joanne Koong, Josef Bacik

On 10/10/24 21:56, Bernd Schubert wrote:
> Hello,
> 
> as discussed during LPC, we would like to have large CQE sizes, at least
> 256B. Ideally 256B for fuse, but CQE512 might be a bit too much...
> 
> Pavel said that this should be ok, but it would be better to have the CQE
> size as function argument.
> Could you give me some hints how this should look like and especially how

I remembered it as SQEs, which would've been much easier. In the current
io_uring infra, usually to post a cqe an opcode specific path stashed the
result value in the request structure and then the core io_uring code will
post it for you. We won't find space for 256B, however, and it'd need to
happen right from the cmd path and follow the rules when / from what
context it can be posted.

I'll take a stub to see how it can look like.

> we are going to communicate the CQE size to the kernel? I guess just adding
> IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier.

3-4 special cases is already odd as an API, we should rather just
pass the desired CQE size.

> I'm basically through with other changes Miklos had been asking for and
> moving fuse headers into the CQE is next.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-10 20:56 Large CQE for fuse headers Bernd Schubert
  2024-10-11 17:57 ` Jens Axboe
  2024-10-11 21:38 ` Pavel Begunkov
@ 2024-10-12  1:55 ` Ming Lei
  2024-10-12 14:38   ` Jens Axboe
  2 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2024-10-12  1:55 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: io-uring, Pavel Begunkov, Jens Axboe, Miklos Szeredi,
	Joanne Koong, Josef Bacik

On Fri, Oct 11, 2024 at 4:56 AM Bernd Schubert
<[email protected]> wrote:
>
> Hello,
>
> as discussed during LPC, we would like to have large CQE sizes, at least
> 256B. Ideally 256B for fuse, but CQE512 might be a bit too much...
>
> Pavel said that this should be ok, but it would be better to have the CQE
> size as function argument.
> Could you give me some hints how this should look like and especially how
> we are going to communicate the CQE size to the kernel? I guess just adding
> IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier.
>
> I'm basically through with other changes Miklos had been asking for and
> moving fuse headers into the CQE is next.

Big CQE may not be efficient,  there are copy from kernel to CQE and
from CQE to userspace. And not flexible, it is one ring-wide property,
if it is big,
any CQE from this ring has to be big.

If you are saying uring_cmd,  another way is to mapped one area for
this purpose,
the fuse driver can write fuse headers to this indexed mmap buffer,
and userspace read it,
which is just efficient, without io_uring core changes. ublk uses this way to
fill IO request header.  But it requires each command to have a unique tag.

thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-12  1:55 ` Ming Lei
@ 2024-10-12 14:38   ` Jens Axboe
  2024-10-13 21:20     ` Bernd Schubert
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2024-10-12 14:38 UTC (permalink / raw)
  To: Ming Lei, Bernd Schubert
  Cc: io-uring, Pavel Begunkov, Miklos Szeredi, Joanne Koong,
	Josef Bacik

On 10/11/24 7:55 PM, Ming Lei wrote:
> On Fri, Oct 11, 2024 at 4:56?AM Bernd Schubert
> <[email protected]> wrote:
>>
>> Hello,
>>
>> as discussed during LPC, we would like to have large CQE sizes, at least
>> 256B. Ideally 256B for fuse, but CQE512 might be a bit too much...
>>
>> Pavel said that this should be ok, but it would be better to have the CQE
>> size as function argument.
>> Could you give me some hints how this should look like and especially how
>> we are going to communicate the CQE size to the kernel? I guess just adding
>> IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier.
>>
>> I'm basically through with other changes Miklos had been asking for and
>> moving fuse headers into the CQE is next.
> 
> Big CQE may not be efficient,  there are copy from kernel to CQE and
> from CQE to userspace. And not flexible, it is one ring-wide property,
> if it is big,
> any CQE from this ring has to be big.

There isn't really a copy - the kernel fills it in, generally the
application itself, just in the kernel, and then the application can
read it on that side. It's the same memory, and it'll also generally be
cache hot when the applicatio reaps it. Unless a lot of time has passed,
obviously.

That said, yeah bigger sqe/cqe is less ideal than smaller ones,
obviously. Currently you can fit 4 normal cqes in a cache line, or a
single sqe. Making either of them bigger will obviously bloat that.

> If you are saying uring_cmd,  another way is to mapped one area for
> this purpose, the fuse driver can write fuse headers to this indexed
> mmap buffer, and userspace read it, which is just efficient, without
> io_uring core changes. ublk uses this way to fill IO request header.
> But it requires each command to have a unique tag.

That may indeed be a decent idea for this too. You don't even need fancy
tagging, you can just use the cqe index for your tag too, as it should
not be bigger than the the cq ring space. Then you can get away with
just using normal cqe sizes, and just have a shared region between the
two where data gets written by the uring_cmd completion, and the app can
access it directly from userspace.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-12 14:38   ` Jens Axboe
@ 2024-10-13 21:20     ` Bernd Schubert
  2024-10-14  2:44       ` Ming Lei
  2024-10-14 10:31       ` Miklos Szeredi
  0 siblings, 2 replies; 23+ messages in thread
From: Bernd Schubert @ 2024-10-13 21:20 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: io-uring, Pavel Begunkov, Miklos Szeredi, Joanne Koong,
	Josef Bacik



On 10/12/24 16:38, Jens Axboe wrote:
> On 10/11/24 7:55 PM, Ming Lei wrote:
>> On Fri, Oct 11, 2024 at 4:56?AM Bernd Schubert
>> <[email protected]> wrote:
>>>
>>> Hello,
>>>
>>> as discussed during LPC, we would like to have large CQE sizes, at least
>>> 256B. Ideally 256B for fuse, but CQE512 might be a bit too much...
>>>
>>> Pavel said that this should be ok, but it would be better to have the CQE
>>> size as function argument.
>>> Could you give me some hints how this should look like and especially how
>>> we are going to communicate the CQE size to the kernel? I guess just adding
>>> IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier.
>>>
>>> I'm basically through with other changes Miklos had been asking for and
>>> moving fuse headers into the CQE is next.
>>
>> Big CQE may not be efficient,  there are copy from kernel to CQE and
>> from CQE to userspace. And not flexible, it is one ring-wide property,
>> if it is big,
>> any CQE from this ring has to be big.
> 
> There isn't really a copy - the kernel fills it in, generally the
> application itself, just in the kernel, and then the application can
> read it on that side. It's the same memory, and it'll also generally be
> cache hot when the applicatio reaps it. Unless a lot of time has passed,
> obviously.
> 
> That said, yeah bigger sqe/cqe is less ideal than smaller ones,
> obviously. Currently you can fit 4 normal cqes in a cache line, or a
> single sqe. Making either of them bigger will obviously bloat that.
> 
>> If you are saying uring_cmd,  another way is to mapped one area for
>> this purpose, the fuse driver can write fuse headers to this indexed
>> mmap buffer, and userspace read it, which is just efficient, without
>> io_uring core changes. ublk uses this way to fill IO request header.
>> But it requires each command to have a unique tag.
> 
> That may indeed be a decent idea for this too. You don't even need fancy
> tagging, you can just use the cqe index for your tag too, as it should
> not be bigger than the the cq ring space. Then you can get away with
> just using normal cqe sizes, and just have a shared region between the
> two where data gets written by the uring_cmd completion, and the app can
> access it directly from userspace.

Would be good if Miklos could chime in here, adding back mmap for headers
wouldn't be difficult, but would add back more fuse-uring startup and
tear-down code.

From performance point of view, I don't know anything about CPU cache
prefetching, but shouldn't the cpu cache logic be able to easily prefetch 
larger linear io-uring rings into 2nd/3rd level caches? And if if the
fuse header is in a separated buffer, it can't auto prefetch that
without additional instructions? I.e. how would the cpu cache logic
auto know about these additional memory areas?


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-13 21:20     ` Bernd Schubert
@ 2024-10-14  2:44       ` Ming Lei
  2024-10-14 11:10         ` Miklos Szeredi
  2024-10-14 10:31       ` Miklos Szeredi
  1 sibling, 1 reply; 23+ messages in thread
From: Ming Lei @ 2024-10-14  2:44 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Jens Axboe, io-uring, Pavel Begunkov, Miklos Szeredi,
	Joanne Koong, Josef Bacik

On Sun, Oct 13, 2024 at 11:20:53PM +0200, Bernd Schubert wrote:
> 
> 
> On 10/12/24 16:38, Jens Axboe wrote:
> > On 10/11/24 7:55 PM, Ming Lei wrote:
> >> On Fri, Oct 11, 2024 at 4:56?AM Bernd Schubert
> >> <[email protected]> wrote:
> >>>
> >>> Hello,
> >>>
> >>> as discussed during LPC, we would like to have large CQE sizes, at least
> >>> 256B. Ideally 256B for fuse, but CQE512 might be a bit too much...
> >>>
> >>> Pavel said that this should be ok, but it would be better to have the CQE
> >>> size as function argument.
> >>> Could you give me some hints how this should look like and especially how
> >>> we are going to communicate the CQE size to the kernel? I guess just adding
> >>> IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier.
> >>>
> >>> I'm basically through with other changes Miklos had been asking for and
> >>> moving fuse headers into the CQE is next.
> >>
> >> Big CQE may not be efficient,  there are copy from kernel to CQE and
> >> from CQE to userspace. And not flexible, it is one ring-wide property,
> >> if it is big,
> >> any CQE from this ring has to be big.
> > 
> > There isn't really a copy - the kernel fills it in, generally the
> > application itself, just in the kernel, and then the application can
> > read it on that side. It's the same memory, and it'll also generally be
> > cache hot when the applicatio reaps it. Unless a lot of time has passed,
> > obviously.
> > 
> > That said, yeah bigger sqe/cqe is less ideal than smaller ones,
> > obviously. Currently you can fit 4 normal cqes in a cache line, or a
> > single sqe. Making either of them bigger will obviously bloat that.
> > 
> >> If you are saying uring_cmd,  another way is to mapped one area for
> >> this purpose, the fuse driver can write fuse headers to this indexed
> >> mmap buffer, and userspace read it, which is just efficient, without
> >> io_uring core changes. ublk uses this way to fill IO request header.
> >> But it requires each command to have a unique tag.
> > 
> > That may indeed be a decent idea for this too. You don't even need fancy
> > tagging, you can just use the cqe index for your tag too, as it should
> > not be bigger than the the cq ring space. Then you can get away with
> > just using normal cqe sizes, and just have a shared region between the
> > two where data gets written by the uring_cmd completion, and the app can
> > access it directly from userspace.
> 
> Would be good if Miklos could chime in here, adding back mmap for headers
> wouldn't be difficult, but would add back more fuse-uring startup and
> tear-down code.
> 
> From performance point of view, I don't know anything about CPU cache
> prefetching, but shouldn't the cpu cache logic be able to easily prefetch 
> larger linear io-uring rings into 2nd/3rd level caches? And if if the
> fuse header is in a separated buffer, it can't auto prefetch that
> without additional instructions? I.e. how would the cpu cache logic
> auto know about these additional memory areas?

It also depends on how fuse user code consumes the big CQE payload, if
fuse header needs to keep in memory a bit long, you may have to copy it
somewhere for post-processing since io_uring(kernel) needs CQE to be
returned back asap.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-14  2:44       ` Ming Lei
@ 2024-10-14 11:10         ` Miklos Szeredi
  2024-10-14 12:47           ` Bernd Schubert
  2024-10-14 13:20           ` Bernd Schubert
  0 siblings, 2 replies; 23+ messages in thread
From: Miklos Szeredi @ 2024-10-14 11:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bernd Schubert, Jens Axboe, io-uring, Pavel Begunkov,
	Joanne Koong, Josef Bacik

On Mon, 14 Oct 2024 at 04:44, Ming Lei <[email protected]> wrote:

> It also depends on how fuse user code consumes the big CQE payload, if
> fuse header needs to keep in memory a bit long, you may have to copy it
> somewhere for post-processing since io_uring(kernel) needs CQE to be
> returned back asap.

Yes.

I'm not quite sure how the libfuse interface will work to accommodate
this.  Currently if the server needs to delay the processing of a
request it would have to copy all arguments, since validity will not
be guaranteed after the callback returns.  With the io_uring
infrastructure the headers would need to be copied, but the data
buffer would be per-request and would not need copying.  This is
relaxing a requirement so existing servers would continue to work
fine, but would not be able to take full advantage of the multi-buffer
design.

Bernd do you have an idea how this would work?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-14 11:10         ` Miklos Szeredi
@ 2024-10-14 12:47           ` Bernd Schubert
  2024-10-14 13:34             ` Pavel Begunkov
  2024-10-14 13:20           ` Bernd Schubert
  1 sibling, 1 reply; 23+ messages in thread
From: Bernd Schubert @ 2024-10-14 12:47 UTC (permalink / raw)
  To: Miklos Szeredi, Ming Lei
  Cc: Jens Axboe, io-uring, Pavel Begunkov, Joanne Koong, Josef Bacik



On 10/14/24 13:10, Miklos Szeredi wrote:
> On Mon, 14 Oct 2024 at 04:44, Ming Lei <[email protected]> wrote:
> 
>> It also depends on how fuse user code consumes the big CQE payload, if
>> fuse header needs to keep in memory a bit long, you may have to copy it
>> somewhere for post-processing since io_uring(kernel) needs CQE to be
>> returned back asap.
> 
> Yes.
> 
> I'm not quite sure how the libfuse interface will work to accommodate
> this.  Currently if the server needs to delay the processing of a
> request it would have to copy all arguments, since validity will not
> be guaranteed after the callback returns.  With the io_uring
> infrastructure the headers would need to be copied, but the data
> buffer would be per-request and would not need copying.  This is
> relaxing a requirement so existing servers would continue to work
> fine, but would not be able to take full advantage of the multi-buffer
> design.
> 
> Bernd do you have an idea how this would work?

I assume returning a CQE is io_uring_cq_advance()?
In my current libfuse io_uring branch that only happens when
all CQEs have been processed. We could also easily switch to 
io_uring_cqe_seen() to do it per CQE.

I don't understand why we need to return CQEs asap, assuming CQ
ring size is the same as SQ ring size - why does it matter? 
If we indeed need to return the CQE before processing the request,
it indeed would be better to have a 2nd memory buffer associated with
the fuse request.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-14 12:47           ` Bernd Schubert
@ 2024-10-14 13:34             ` Pavel Begunkov
  2024-10-14 15:21               ` Bernd Schubert
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2024-10-14 13:34 UTC (permalink / raw)
  To: Bernd Schubert, Miklos Szeredi, Ming Lei
  Cc: Jens Axboe, io-uring, Joanne Koong, Josef Bacik

On 10/14/24 13:47, Bernd Schubert wrote:
> On 10/14/24 13:10, Miklos Szeredi wrote:
>> On Mon, 14 Oct 2024 at 04:44, Ming Lei <[email protected]> wrote:
>>
>>> It also depends on how fuse user code consumes the big CQE payload, if
>>> fuse header needs to keep in memory a bit long, you may have to copy it
>>> somewhere for post-processing since io_uring(kernel) needs CQE to be
>>> returned back asap.
>>
>> Yes.
>>
>> I'm not quite sure how the libfuse interface will work to accommodate
>> this.  Currently if the server needs to delay the processing of a
>> request it would have to copy all arguments, since validity will not
>> be guaranteed after the callback returns.  With the io_uring
>> infrastructure the headers would need to be copied, but the data
>> buffer would be per-request and would not need copying.  This is
>> relaxing a requirement so existing servers would continue to work
>> fine, but would not be able to take full advantage of the multi-buffer
>> design.
>>
>> Bernd do you have an idea how this would work?
> 
> I assume returning a CQE is io_uring_cq_advance()?

Yes

> In my current libfuse io_uring branch that only happens when
> all CQEs have been processed. We could also easily switch to
> io_uring_cqe_seen() to do it per CQE.

Either that one.

> I don't understand why we need to return CQEs asap, assuming CQ
> ring size is the same as SQ ring size - why does it matter?

The SQE is consumed once the request is issued, but nothing
prevents the user to keep the QD larger than the SQ size,
e.g. do M syscalls each ending N requests and then wait for
N * M completions.

> If we indeed need to return the CQE before processing the request,
> it indeed would be better to have a 2nd memory buffer associated with
> the fuse request.

With that said, the usual problem is to size the CQ so that it
(almost) never overflows, otherwise it hurts performance. With
DEFER_TASKRUN you can delay returning CQEs to the kernel until
the next time you wait for completions, i.e. do io_uring waiting
syscall. Without the flag, CQEs may come asynchronously to the
user, so need a bit more consideration.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-14 13:34             ` Pavel Begunkov
@ 2024-10-14 15:21               ` Bernd Schubert
  2024-10-14 17:48                 ` Pavel Begunkov
  0 siblings, 1 reply; 23+ messages in thread
From: Bernd Schubert @ 2024-10-14 15:21 UTC (permalink / raw)
  To: Pavel Begunkov, Miklos Szeredi, Ming Lei
  Cc: Jens Axboe, io-uring, Joanne Koong, Josef Bacik



On 10/14/24 15:34, Pavel Begunkov wrote:
> On 10/14/24 13:47, Bernd Schubert wrote:
>> On 10/14/24 13:10, Miklos Szeredi wrote:
>>> On Mon, 14 Oct 2024 at 04:44, Ming Lei <[email protected]> wrote:
>>>
>>>> It also depends on how fuse user code consumes the big CQE payload, if
>>>> fuse header needs to keep in memory a bit long, you may have to copy it
>>>> somewhere for post-processing since io_uring(kernel) needs CQE to be
>>>> returned back asap.
>>>
>>> Yes.
>>>
>>> I'm not quite sure how the libfuse interface will work to accommodate
>>> this.  Currently if the server needs to delay the processing of a
>>> request it would have to copy all arguments, since validity will not
>>> be guaranteed after the callback returns.  With the io_uring
>>> infrastructure the headers would need to be copied, but the data
>>> buffer would be per-request and would not need copying.  This is
>>> relaxing a requirement so existing servers would continue to work
>>> fine, but would not be able to take full advantage of the multi-buffer
>>> design.
>>>
>>> Bernd do you have an idea how this would work?
>>
>> I assume returning a CQE is io_uring_cq_advance()?
> 
> Yes
> 
>> In my current libfuse io_uring branch that only happens when
>> all CQEs have been processed. We could also easily switch to
>> io_uring_cqe_seen() to do it per CQE.
> 
> Either that one.
> 
>> I don't understand why we need to return CQEs asap, assuming CQ
>> ring size is the same as SQ ring size - why does it matter?
> 
> The SQE is consumed once the request is issued, but nothing
> prevents the user to keep the QD larger than the SQ size,
> e.g. do M syscalls each ending N requests and then wait for
> N * M completions.
> 

I need a bit help to understand this. Do you mean that in typical
io-uring usage SQEs get submitted, already released in kernel
and then users submit even more SQEs? And that creates a
kernel queue depth for completion?
I guess as long as libfuse does not expose the ring we don't have
that issue. But then yeah, exposing the ring to fuse-server/daemon
is planned...



>> If we indeed need to return the CQE before processing the request,
>> it indeed would be better to have a 2nd memory buffer associated with
>> the fuse request.
> 
> With that said, the usual problem is to size the CQ so that it
> (almost) never overflows, otherwise it hurts performance. With
> DEFER_TASKRUN you can delay returning CQEs to the kernel until
> the next time you wait for completions, i.e. do io_uring waiting
> syscall. Without the flag, CQEs may come asynchronously to the
> user, so need a bit more consideration.
> 

Current libfuse code has it disabled IORING_SETUP_SINGLE_ISSUER,
IORING_SETUP_DEFER_TASKRUN, IORING_SETUP_TASKRUN_FLAG and
IORING_SETUP_COOP_TASKRUN as these are somehow slowing down
things.
Not sure if this thread is optimal to discuss this. I would
also first like to sort out all the other design topics before
going into fine-tuning...


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-14 15:21               ` Bernd Schubert
@ 2024-10-14 17:48                 ` Pavel Begunkov
  2024-10-14 21:27                   ` Bernd Schubert
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2024-10-14 17:48 UTC (permalink / raw)
  To: Bernd Schubert, Miklos Szeredi, Ming Lei
  Cc: Jens Axboe, io-uring, Joanne Koong, Josef Bacik

On 10/14/24 16:21, Bernd Schubert wrote:
> On 10/14/24 15:34, Pavel Begunkov wrote:
>> On 10/14/24 13:47, Bernd Schubert wrote:
>>> On 10/14/24 13:10, Miklos Szeredi wrote:
>>>> On Mon, 14 Oct 2024 at 04:44, Ming Lei <[email protected]> wrote:
>>>>
>>>>> It also depends on how fuse user code consumes the big CQE payload, if
>>>>> fuse header needs to keep in memory a bit long, you may have to copy it
>>>>> somewhere for post-processing since io_uring(kernel) needs CQE to be
>>>>> returned back asap.
>>>>
>>>> Yes.
>>>>
>>>> I'm not quite sure how the libfuse interface will work to accommodate
>>>> this.  Currently if the server needs to delay the processing of a
>>>> request it would have to copy all arguments, since validity will not
>>>> be guaranteed after the callback returns.  With the io_uring
>>>> infrastructure the headers would need to be copied, but the data
>>>> buffer would be per-request and would not need copying.  This is
>>>> relaxing a requirement so existing servers would continue to work
>>>> fine, but would not be able to take full advantage of the multi-buffer
>>>> design.
>>>>
>>>> Bernd do you have an idea how this would work?
>>>
>>> I assume returning a CQE is io_uring_cq_advance()?
>>
>> Yes
>>
>>> In my current libfuse io_uring branch that only happens when
>>> all CQEs have been processed. We could also easily switch to
>>> io_uring_cqe_seen() to do it per CQE.
>>
>> Either that one.
>>
>>> I don't understand why we need to return CQEs asap, assuming CQ
>>> ring size is the same as SQ ring size - why does it matter?
>>
>> The SQE is consumed once the request is issued, but nothing
>> prevents the user to keep the QD larger than the SQ size,
>> e.g. do M syscalls each ending N requests and then wait for

typo, Sending or queueing N requests. In other words it's
perfectly legal to:

It's perfectly legal to:

ring = create_ring(nr_cqes=N);
for (i = 0 .. M) {
	for (i = 0..N)
		prep_sqe();
	submit_all_sqes();
}
wait(nr=N * M);


With a caveat that the wait can't complete more than the
CQ size, but you can even add a loop atop of the wait.

while (nr_inflight_cqes) {
	wait(nr = min(CQ_size, nr_inflight_cqes);
	process_cqes();
}

Or do something more elaborate, often frameworks allow
to push any number of requests not caring too much about
exactly matching queue sizes apart from sizing them for
performance reasons.

>> N * M completions.
>>
> 
> I need a bit help to understand this. Do you mean that in typical
> io-uring usage SQEs get submitted, already released in kernel

Typical or not, but the number of requests in flight is not
limited by the size of the SQ, it only limits how many
requests you can queue per syscall, i.e. per io_uring_submit().


> and then users submit even more SQEs? And that creates a
> kernel queue depth for completion?
> I guess as long as libfuse does not expose the ring we don't have
> that issue. But then yeah, exposing the ring to fuse-server/daemon
> is planned...

Could be, for example you don't need to care about overflows
at all if the CQ size is always larger than the number of
requests in flight. Perhaps the simplest example:

prep_requests(nr=N);
wait_cq(nr=N);
process_cqes(nr=N);

>>> If we indeed need to return the CQE before processing the request,
>>> it indeed would be better to have a 2nd memory buffer associated with
>>> the fuse request.
>>
>> With that said, the usual problem is to size the CQ so that it
>> (almost) never overflows, otherwise it hurts performance. With
>> DEFER_TASKRUN you can delay returning CQEs to the kernel until
>> the next time you wait for completions, i.e. do io_uring waiting
>> syscall. Without the flag, CQEs may come asynchronously to the
>> user, so need a bit more consideration.
>>
> 
> Current libfuse code has it disabled IORING_SETUP_SINGLE_ISSUER,
> IORING_SETUP_DEFER_TASKRUN, IORING_SETUP_TASKRUN_FLAG and
> IORING_SETUP_COOP_TASKRUN as these are somehow slowing down
> things.

Those flags are not a requirement, you can try to size the
CQ so that overflows are rare, it's just a bit easier to do
with DEFER_TASKRUN.

> Not sure if this thread is optimal to discuss this. I would
> also first like to sort out all the other design topics before
> going into fine-tuning...

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-14 17:48                 ` Pavel Begunkov
@ 2024-10-14 21:27                   ` Bernd Schubert
  2024-10-16 10:54                     ` Miklos Szeredi
  0 siblings, 1 reply; 23+ messages in thread
From: Bernd Schubert @ 2024-10-14 21:27 UTC (permalink / raw)
  To: Pavel Begunkov, Miklos Szeredi, Ming Lei
  Cc: Jens Axboe, io-uring, Joanne Koong, Josef Bacik



On 10/14/24 19:48, Pavel Begunkov wrote:
> On 10/14/24 16:21, Bernd Schubert wrote:
>> On 10/14/24 15:34, Pavel Begunkov wrote:
>>> On 10/14/24 13:47, Bernd Schubert wrote:
>>>> On 10/14/24 13:10, Miklos Szeredi wrote:
>>>>> On Mon, 14 Oct 2024 at 04:44, Ming Lei <[email protected]> wrote:
>>>>>
>>>>>> It also depends on how fuse user code consumes the big CQE
>>>>>> payload, if
>>>>>> fuse header needs to keep in memory a bit long, you may have to
>>>>>> copy it
>>>>>> somewhere for post-processing since io_uring(kernel) needs CQE to be
>>>>>> returned back asap.
>>>>>
>>>>> Yes.
>>>>>
>>>>> I'm not quite sure how the libfuse interface will work to accommodate
>>>>> this.  Currently if the server needs to delay the processing of a
>>>>> request it would have to copy all arguments, since validity will not
>>>>> be guaranteed after the callback returns.  With the io_uring
>>>>> infrastructure the headers would need to be copied, but the data
>>>>> buffer would be per-request and would not need copying.  This is
>>>>> relaxing a requirement so existing servers would continue to work
>>>>> fine, but would not be able to take full advantage of the multi-buffer
>>>>> design.
>>>>>
>>>>> Bernd do you have an idea how this would work?
>>>>
>>>> I assume returning a CQE is io_uring_cq_advance()?
>>>
>>> Yes
>>>
>>>> In my current libfuse io_uring branch that only happens when
>>>> all CQEs have been processed. We could also easily switch to
>>>> io_uring_cqe_seen() to do it per CQE.
>>>
>>> Either that one.
>>>
>>>> I don't understand why we need to return CQEs asap, assuming CQ
>>>> ring size is the same as SQ ring size - why does it matter?
>>>
>>> The SQE is consumed once the request is issued, but nothing
>>> prevents the user to keep the QD larger than the SQ size,
>>> e.g. do M syscalls each ending N requests and then wait for
> 
> typo, Sending or queueing N requests. In other words it's
> perfectly legal to:
> 
> It's perfectly legal to:
> 
> ring = create_ring(nr_cqes=N);
> for (i = 0 .. M) {
>     for (i = 0..N)
>         prep_sqe();
>     submit_all_sqes();
> }
> wait(nr=N * M);
> 
> 
> With a caveat that the wait can't complete more than the
> CQ size, but you can even add a loop atop of the wait.
> 
> while (nr_inflight_cqes) {
>     wait(nr = min(CQ_size, nr_inflight_cqes);
>     process_cqes();
> }
> 
> Or do something more elaborate, often frameworks allow
> to push any number of requests not caring too much about
> exactly matching queue sizes apart from sizing them for
> performance reasons.
> 
>>> N * M completions.
>>>
>>
>> I need a bit help to understand this. Do you mean that in typical
>> io-uring usage SQEs get submitted, already released in kernel
> 
> Typical or not, but the number of requests in flight is not
> limited by the size of the SQ, it only limits how many
> requests you can queue per syscall, i.e. per io_uring_submit().
> 
> 
>> and then users submit even more SQEs? And that creates a
>> kernel queue depth for completion?
>> I guess as long as libfuse does not expose the ring we don't have
>> that issue. But then yeah, exposing the ring to fuse-server/daemon
>> is planned...
> 
> Could be, for example you don't need to care about overflows
> at all if the CQ size is always larger than the number of
> requests in flight. Perhaps the simplest example:
> 
> prep_requests(nr=N);
> wait_cq(nr=N);
> process_cqes(nr=N);


With only libfuse as ring user it is more like 

prep_requests(nr=N);
wait_cq(1); ==> we must not wait for more than 1 as more might never arrive
io_uring_for_each_cqe {
}
	
I still think no issue with libfuse (or any other fuse lib) as single ring-user,
but if the same ring then gets used for more all of that might come up.

@Miklos maybe we avoid using large CQEs/SQEs and instead set up our own
separate buffer for FUSE headers?

Thanks,
Bernd



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-14 21:27                   ` Bernd Schubert
@ 2024-10-16 10:54                     ` Miklos Szeredi
  2024-10-16 11:53                       ` Bernd Schubert
  0 siblings, 1 reply; 23+ messages in thread
From: Miklos Szeredi @ 2024-10-16 10:54 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Pavel Begunkov, Ming Lei, Jens Axboe, io-uring, Joanne Koong,
	Josef Bacik

On Mon, 14 Oct 2024 at 23:27, Bernd Schubert <[email protected]> wrote:

> With only libfuse as ring user it is more like
>
> prep_requests(nr=N);
> wait_cq(1); ==> we must not wait for more than 1 as more might never arrive
> io_uring_for_each_cqe {
> }

Right.

I think the point Pavel is trying to make is that  io_uring queue
sizes don't have to match fuse queue size.  So we could have
sq_entries=4, cq_entries=4 and have the server queue 64
FUSE_URING_REQ_FETCH commands, it just has to do that in batches of 4
max.

> @Miklos maybe we avoid using large CQEs/SQEs and instead set up our own
> separate buffer for FUSE headers?

The only gain from this would be in the case where the uring is used
for non-fuse requests as well, in which case the extra space in the
queue entries would be unused (i.e. 48 unused bytes in the cacheline).
I don't know if this is a realistic use case or not.  It's definitely
a challenge to create a library API that allows this.

The disadvantage would be a more complex interface.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-16 10:54                     ` Miklos Szeredi
@ 2024-10-16 11:53                       ` Bernd Schubert
  2024-10-16 12:24                         ` Miklos Szeredi
  2024-10-17  0:59                         ` Ming Lei
  0 siblings, 2 replies; 23+ messages in thread
From: Bernd Schubert @ 2024-10-16 11:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Pavel Begunkov, Ming Lei, Jens Axboe, io-uring, Joanne Koong,
	Josef Bacik



On 10/16/24 12:54, Miklos Szeredi wrote:
> On Mon, 14 Oct 2024 at 23:27, Bernd Schubert <[email protected]> wrote:
> 
>> With only libfuse as ring user it is more like
>>
>> prep_requests(nr=N);
>> wait_cq(1); ==> we must not wait for more than 1 as more might never arrive
>> io_uring_for_each_cqe {
>> }
> 
> Right.
> 
> I think the point Pavel is trying to make is that  io_uring queue
> sizes don't have to match fuse queue size.  So we could have
> sq_entries=4, cq_entries=4 and have the server queue 64
> FUSE_URING_REQ_FETCH commands, it just has to do that in batches of 4
> max.

Hmm ok, I guess that might matter when payload is small compared to 
SQ/CQ size and the system is low in memory.

> 
>> @Miklos maybe we avoid using large CQEs/SQEs and instead set up our own
>> separate buffer for FUSE headers?
> 
> The only gain from this would be in the case where the uring is used
> for non-fuse requests as well, in which case the extra space in the
> queue entries would be unused (i.e. 48 unused bytes in the cacheline).
> I don't know if this is a realistic use case or not.  It's definitely
> a challenge to create a library API that allows this.
> 
> The disadvantage would be a more complex interface.

I don't think that complicated. In the end it is just another pointer
that needs to be mapped. We don't even need to use mmap.
At least for zero-copy we will need to the ring non-fuse requests. 
For the DDN use case, we are using another io-uring for tcp requests,
I would actually like to switch that to the same ring.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-16 11:53                       ` Bernd Schubert
@ 2024-10-16 12:24                         ` Miklos Szeredi
  2024-10-17  0:59                         ` Ming Lei
  1 sibling, 0 replies; 23+ messages in thread
From: Miklos Szeredi @ 2024-10-16 12:24 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Pavel Begunkov, Ming Lei, Jens Axboe, io-uring, Joanne Koong,
	Josef Bacik

On Wed, 16 Oct 2024 at 13:53, Bernd Schubert <[email protected]> wrote:

> I don't think that complicated. In the end it is just another pointer
> that needs to be mapped. We don't even need to use mmap.
> At least for zero-copy we will need to the ring non-fuse requests.
> For the DDN use case, we are using another io-uring for tcp requests,
> I would actually like to switch that to the same ring.

Okay, let's try and see how that works.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-16 11:53                       ` Bernd Schubert
  2024-10-16 12:24                         ` Miklos Szeredi
@ 2024-10-17  0:59                         ` Ming Lei
  1 sibling, 0 replies; 23+ messages in thread
From: Ming Lei @ 2024-10-17  0:59 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Pavel Begunkov, Jens Axboe, io-uring,
	Joanne Koong, Josef Bacik

On Wed, Oct 16, 2024 at 01:53:00PM +0200, Bernd Schubert wrote:
> 
> 
> On 10/16/24 12:54, Miklos Szeredi wrote:
> > On Mon, 14 Oct 2024 at 23:27, Bernd Schubert <[email protected]> wrote:
> > 
> >> With only libfuse as ring user it is more like
> >>
> >> prep_requests(nr=N);
> >> wait_cq(1); ==> we must not wait for more than 1 as more might never arrive
> >> io_uring_for_each_cqe {
> >> }
> > 
> > Right.
> > 
> > I think the point Pavel is trying to make is that  io_uring queue
> > sizes don't have to match fuse queue size.  So we could have
> > sq_entries=4, cq_entries=4 and have the server queue 64
> > FUSE_URING_REQ_FETCH commands, it just has to do that in batches of 4
> > max.
> 
> Hmm ok, I guess that might matter when payload is small compared to 
> SQ/CQ size and the system is low in memory.
> 
> > 
> >> @Miklos maybe we avoid using large CQEs/SQEs and instead set up our own
> >> separate buffer for FUSE headers?
> > 
> > The only gain from this would be in the case where the uring is used
> > for non-fuse requests as well, in which case the extra space in the
> > queue entries would be unused (i.e. 48 unused bytes in the cacheline).
> > I don't know if this is a realistic use case or not.  It's definitely
> > a challenge to create a library API that allows this.
> > 
> > The disadvantage would be a more complex interface.
> 
> I don't think that complicated. In the end it is just another pointer
> that needs to be mapped. We don't even need to use mmap.
> At least for zero-copy we will need to the ring non-fuse requests. 
> For the DDN use case, we are using another io-uring for tcp requests,
> I would actually like to switch that to the same ring.

I remember the biggest trouble of using same ring in ublk could be exporting
the ring for API users, but it is often per-task, seems not too hard to deal
with.

The pros is you needn't use eventfd to communicate with fuse command
uring(thread) any more, and more uring IOs can be handled in single batch.
Performance is better, with less task switch involved, without extra
communication.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-14 11:10         ` Miklos Szeredi
  2024-10-14 12:47           ` Bernd Schubert
@ 2024-10-14 13:20           ` Bernd Schubert
  1 sibling, 0 replies; 23+ messages in thread
From: Bernd Schubert @ 2024-10-14 13:20 UTC (permalink / raw)
  To: Miklos Szeredi, Ming Lei
  Cc: Jens Axboe, io-uring, Pavel Begunkov, Joanne Koong, Josef Bacik,
	Antonio SJ Musumeci



On 10/14/24 13:10, Miklos Szeredi wrote:
> On Mon, 14 Oct 2024 at 04:44, Ming Lei <[email protected]> wrote:
> 
>> It also depends on how fuse user code consumes the big CQE payload, if
>> fuse header needs to keep in memory a bit long, you may have to copy it
>> somewhere for post-processing since io_uring(kernel) needs CQE to be
>> returned back asap.
> 
> Yes.
> 
> I'm not quite sure how the libfuse interface will work to accommodate
> this.  Currently if the server needs to delay the processing of a
> request it would have to copy all arguments, since validity will not
> be guaranteed after the callback returns.  With the io_uring

Well, it depends on the libfuse implementation. In plain libfuse the
buffer is associated with the the thread. This could be improved
by creating a request pool and buffers per request. AFAIK, Antonio
has done that for mergerfs.

> infrastructure the headers would need to be copied, but the data
> buffer would be per-request and would not need copying.  This is
> relaxing a requirement so existing servers would continue to work

Yep, that is actually how we use it at ddn for requests over io-uring.

> fine, but would not be able to take full advantage of the multi-buffer
> design.

What do you actually mean with "multi-buffer design"?



Thanks,
Bernd

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Large CQE for fuse headers
  2024-10-13 21:20     ` Bernd Schubert
  2024-10-14  2:44       ` Ming Lei
@ 2024-10-14 10:31       ` Miklos Szeredi
  1 sibling, 0 replies; 23+ messages in thread
From: Miklos Szeredi @ 2024-10-14 10:31 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Jens Axboe, Ming Lei, io-uring, Pavel Begunkov, Joanne Koong,
	Josef Bacik

On Sun, 13 Oct 2024 at 23:20, Bernd Schubert <[email protected]> wrote:
>
>
>
> On 10/12/24 16:38, Jens Axboe wrote:

> > That may indeed be a decent idea for this too. You don't even need fancy
> > tagging, you can just use the cqe index for your tag too, as it should
> > not be bigger than the the cq ring space. Then you can get away with
> > just using normal cqe sizes, and just have a shared region between the
> > two where data gets written by the uring_cmd completion, and the app can
> > access it directly from userspace.
>
> Would be good if Miklos could chime in here, adding back mmap for headers
> wouldn't be difficult, but would add back more fuse-uring startup and
> tear-down code.

My worry is making the API more complex, OTOH I understand the need
for io_uring to refrain from adding fuse specific features.

Also seems like io_uring is accounting some of the pinned memory, but
for the queues themselves it does not do that, even though the max
number of sqes (32k) can take substantial amount of memory.   Growing
the cqe would make this worse, but this could be fixed by adding the
missing accounting, possibly only if using non-standard cqe sizes to
avoid breaking backward comatibility.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-10-17  0:59 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-10 20:56 Large CQE for fuse headers Bernd Schubert
2024-10-11 17:57 ` Jens Axboe
2024-10-11 18:35   ` Bernd Schubert
2024-10-11 18:39     ` Jens Axboe
2024-10-11 19:03       ` Bernd Schubert
2024-10-11 19:24         ` Jens Axboe
2024-10-11 21:38 ` Pavel Begunkov
2024-10-12  1:55 ` Ming Lei
2024-10-12 14:38   ` Jens Axboe
2024-10-13 21:20     ` Bernd Schubert
2024-10-14  2:44       ` Ming Lei
2024-10-14 11:10         ` Miklos Szeredi
2024-10-14 12:47           ` Bernd Schubert
2024-10-14 13:34             ` Pavel Begunkov
2024-10-14 15:21               ` Bernd Schubert
2024-10-14 17:48                 ` Pavel Begunkov
2024-10-14 21:27                   ` Bernd Schubert
2024-10-16 10:54                     ` Miklos Szeredi
2024-10-16 11:53                       ` Bernd Schubert
2024-10-16 12:24                         ` Miklos Szeredi
2024-10-17  0:59                         ` Ming Lei
2024-10-14 13:20           ` Bernd Schubert
2024-10-14 10:31       ` Miklos Szeredi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox