* Large CQE for fuse headers @ 2024-10-10 20:56 Bernd Schubert 2024-10-11 17:57 ` Jens Axboe ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Bernd Schubert @ 2024-10-10 20:56 UTC (permalink / raw) To: io-uring Cc: Pavel Begunkov, Jens Axboe, Miklos Szeredi, Joanne Koong, Josef Bacik Hello, as discussed during LPC, we would like to have large CQE sizes, at least 256B. Ideally 256B for fuse, but CQE512 might be a bit too much... Pavel said that this should be ok, but it would be better to have the CQE size as function argument. Could you give me some hints how this should look like and especially how we are going to communicate the CQE size to the kernel? I guess just adding IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier. I'm basically through with other changes Miklos had been asking for and moving fuse headers into the CQE is next. Thanks, Bernd ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-10 20:56 Large CQE for fuse headers Bernd Schubert @ 2024-10-11 17:57 ` Jens Axboe 2024-10-11 18:35 ` Bernd Schubert 2024-10-11 21:38 ` Pavel Begunkov 2024-10-12 1:55 ` Ming Lei 2 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2024-10-11 17:57 UTC (permalink / raw) To: Bernd Schubert, io-uring Cc: Pavel Begunkov, Miklos Szeredi, Joanne Koong, Josef Bacik On 10/10/24 2:56 PM, Bernd Schubert wrote: > Hello, > > as discussed during LPC, we would like to have large CQE sizes, at least > 256B. Ideally 256B for fuse, but CQE512 might be a bit too much... > > Pavel said that this should be ok, but it would be better to have the CQE > size as function argument. > Could you give me some hints how this should look like and especially how > we are going to communicate the CQE size to the kernel? I guess just adding > IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier. Not Pavel and unfortunately I could not be at that LPC discussion, but yeah I don't see why not just adding the necessary SETUP arg for this would not be the way to go. As long as they are power-of-2, then all it'll impact on both the kernel and liburing side is what size shift to use when iterating CQEs. Since this obviously means larger CQ rings, one nice side effect is that since 6.10 we don't need contig pages to map any of the rings. So should work just fine regardless of memory fragmentation, where previously that would've been a concern. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-11 17:57 ` Jens Axboe @ 2024-10-11 18:35 ` Bernd Schubert 2024-10-11 18:39 ` Jens Axboe 0 siblings, 1 reply; 23+ messages in thread From: Bernd Schubert @ 2024-10-11 18:35 UTC (permalink / raw) To: Jens Axboe, io-uring Cc: Pavel Begunkov, Miklos Szeredi, Joanne Koong, Josef Bacik On 10/11/24 19:57, Jens Axboe wrote: > On 10/10/24 2:56 PM, Bernd Schubert wrote: >> Hello, >> >> as discussed during LPC, we would like to have large CQE sizes, at least >> 256B. Ideally 256B for fuse, but CQE512 might be a bit too much... >> >> Pavel said that this should be ok, but it would be better to have the CQE >> size as function argument. >> Could you give me some hints how this should look like and especially how >> we are going to communicate the CQE size to the kernel? I guess just adding >> IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier. > > Not Pavel and unfortunately I could not be at that LPC discussion, but > yeah I don't see why not just adding the necessary SETUP arg for this > would not be the way to go. As long as they are power-of-2, then all > it'll impact on both the kernel and liburing side is what size shift to > use when iterating CQEs. Thanks, Pavel also wanted power-of-2, although 512 is a bit much for fuse. Well, maybe 256 will be sufficient. Going to look into adding that parameter during the next days. > > Since this obviously means larger CQ rings, one nice side effect is that > since 6.10 we don't need contig pages to map any of the rings. So should > work just fine regardless of memory fragmentation, where previously that > would've been a concern. > Out of interest, what is the change? Up to fuse-io-uring rfc2 I was vmalloced buffers for fuse that got mmaped - was working fine. Miklos just wants to avoid that kernel allocates large chunks of memory on behalf of users. Thanks, Bernd ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-11 18:35 ` Bernd Schubert @ 2024-10-11 18:39 ` Jens Axboe 2024-10-11 19:03 ` Bernd Schubert 0 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2024-10-11 18:39 UTC (permalink / raw) To: Bernd Schubert, io-uring Cc: Pavel Begunkov, Miklos Szeredi, Joanne Koong, Josef Bacik On 10/11/24 12:35 PM, Bernd Schubert wrote: > On 10/11/24 19:57, Jens Axboe wrote: >> On 10/10/24 2:56 PM, Bernd Schubert wrote: >>> Hello, >>> >>> as discussed during LPC, we would like to have large CQE sizes, at least >>> 256B. Ideally 256B for fuse, but CQE512 might be a bit too much... >>> >>> Pavel said that this should be ok, but it would be better to have the CQE >>> size as function argument. >>> Could you give me some hints how this should look like and especially how >>> we are going to communicate the CQE size to the kernel? I guess just adding >>> IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier. >> >> Not Pavel and unfortunately I could not be at that LPC discussion, but >> yeah I don't see why not just adding the necessary SETUP arg for this >> would not be the way to go. As long as they are power-of-2, then all >> it'll impact on both the kernel and liburing side is what size shift to >> use when iterating CQEs. > > Thanks, Pavel also wanted power-of-2, although 512 is a bit much for fuse. > Well, maybe 256 will be sufficient. Going to look into adding that parameter > during the next days. We really have to keep it pow-of-2 just to avoid convoluting the logic (and overhead) of iterating the CQ ring and CQEs. You can search for IORING_SETUP_CQE32 in the kernel to see how it's just a shift, and ditto on the liburing side. Curious, what's all the space needed for? >> Since this obviously means larger CQ rings, one nice side effect is that >> since 6.10 we don't need contig pages to map any of the rings. So should >> work just fine regardless of memory fragmentation, where previously that >> would've been a concern. >> > > Out of interest, what is the change? Up to fuse-io-uring rfc2 I was > vmalloced buffers for fuse that got mmaped - was working fine. Miklos just > wants to avoid that kernel allocates large chunks of memory on behalf of > users. It was the change that got rid of remap_pfn_range() for mapping, and switched to vm_insert_page(s) instead. Memory overhead should generally not be too bad, it's all about sizing the rings appropriately. The much bigger concern is needing contig memory, as that can become scarce after longer uptimes, even with plenty of memory free. This is particularly important if you need 512b CQEs, obviously. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-11 18:39 ` Jens Axboe @ 2024-10-11 19:03 ` Bernd Schubert 2024-10-11 19:24 ` Jens Axboe 0 siblings, 1 reply; 23+ messages in thread From: Bernd Schubert @ 2024-10-11 19:03 UTC (permalink / raw) To: Jens Axboe, io-uring Cc: Pavel Begunkov, Miklos Szeredi, Joanne Koong, Josef Bacik On 10/11/24 20:39, Jens Axboe wrote: > On 10/11/24 12:35 PM, Bernd Schubert wrote: >> On 10/11/24 19:57, Jens Axboe wrote: >>> On 10/10/24 2:56 PM, Bernd Schubert wrote: >>>> Hello, >>>> >>>> as discussed during LPC, we would like to have large CQE sizes, at least >>>> 256B. Ideally 256B for fuse, but CQE512 might be a bit too much... >>>> >>>> Pavel said that this should be ok, but it would be better to have the CQE >>>> size as function argument. >>>> Could you give me some hints how this should look like and especially how >>>> we are going to communicate the CQE size to the kernel? I guess just adding >>>> IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier. >>> >>> Not Pavel and unfortunately I could not be at that LPC discussion, but >>> yeah I don't see why not just adding the necessary SETUP arg for this >>> would not be the way to go. As long as they are power-of-2, then all >>> it'll impact on both the kernel and liburing side is what size shift to >>> use when iterating CQEs. >> >> Thanks, Pavel also wanted power-of-2, although 512 is a bit much for fuse. >> Well, maybe 256 will be sufficient. Going to look into adding that parameter >> during the next days. > > We really have to keep it pow-of-2 just to avoid convoluting the logic > (and overhead) of iterating the CQ ring and CQEs. You can search for > IORING_SETUP_CQE32 in the kernel to see how it's just a shift, and ditto > on the liburing side. Thanks, going to look into it. > > Curious, what's all the space needed for? The basic fuse header: struct fuse_in_header -> current 40B and per request header headers, I think current max is 64. And then some extra compat space for both, so that they can be safely extended in the future (which is currently an issue). > >>> Since this obviously means larger CQ rings, one nice side effect is that >>> since 6.10 we don't need contig pages to map any of the rings. So should >>> work just fine regardless of memory fragmentation, where previously that >>> would've been a concern. >>> >> >> Out of interest, what is the change? Up to fuse-io-uring rfc2 I was >> vmalloced buffers for fuse that got mmaped - was working fine. Miklos just >> wants to avoid that kernel allocates large chunks of memory on behalf of >> users. > > It was the change that got rid of remap_pfn_range() for mapping, and > switched to vm_insert_page(s) instead. Memory overhead should generally > not be too bad, it's all about sizing the rings appropriately. The much > bigger concern is needing contig memory, as that can become scarce after > longer uptimes, even with plenty of memory free. This is particularly > important if you need 512b CQEs, obviously. > For sure, I was just curious what you had changed. I think I had looked into that io-uring code around 2 years ago. Going to look into the update io-uring code, thanks for the hint. For fuse I was just using remap_vmalloc_range(). https://lore.kernel.org/all/[email protected]/ Thanks, Bernd ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-11 19:03 ` Bernd Schubert @ 2024-10-11 19:24 ` Jens Axboe 0 siblings, 0 replies; 23+ messages in thread From: Jens Axboe @ 2024-10-11 19:24 UTC (permalink / raw) To: Bernd Schubert, io-uring Cc: Pavel Begunkov, Miklos Szeredi, Joanne Koong, Josef Bacik On 10/11/24 1:03 PM, Bernd Schubert wrote: >> >> Curious, what's all the space needed for? > > The basic fuse header: struct fuse_in_header -> current 40B > and per request header headers, I think current max is 64. > > And then some extra compat space for both, so that they can be safely > extended in the future (which is currently an issue). So that's 104b, and regular CQE stuff too I presume, so that's 104+16 == 120 bytes. That'd fit in a 128b CQE, and 256b would be pleeeeenty? Just squeeze a version field or something in there so you know what the version is for future proofing? I would strongly recommend making it as large as you need it for those things, but no longer just for compat/future reasons. Eg 128b over 256b is a win for sure, and 256b over 512b is a REALLY nice win. >>>> Since this obviously means larger CQ rings, one nice side effect is that >>>> since 6.10 we don't need contig pages to map any of the rings. So should >>>> work just fine regardless of memory fragmentation, where previously that >>>> would've been a concern. >>>> >>> >>> Out of interest, what is the change? Up to fuse-io-uring rfc2 I was >>> vmalloced buffers for fuse that got mmaped - was working fine. Miklos just >>> wants to avoid that kernel allocates large chunks of memory on behalf of >>> users. >> >> It was the change that got rid of remap_pfn_range() for mapping, and >> switched to vm_insert_page(s) instead. Memory overhead should generally >> not be too bad, it's all about sizing the rings appropriately. The much >> bigger concern is needing contig memory, as that can become scarce after >> longer uptimes, even with plenty of memory free. This is particularly >> important if you need 512b CQEs, obviously. >> > > For sure, I was just curious what you had changed. I think I had looked into > that io-uring code around 2 years ago. Going to look into the update > io-uring code, thanks for the hint. > For fuse I was just using remap_vmalloc_range(). > > https://lore.kernel.org/all/[email protected]/ That's the one to use, io_uring was just stuck with using the wrong API for quite a while, but that got sorted. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-10 20:56 Large CQE for fuse headers Bernd Schubert 2024-10-11 17:57 ` Jens Axboe @ 2024-10-11 21:38 ` Pavel Begunkov 2024-10-12 1:55 ` Ming Lei 2 siblings, 0 replies; 23+ messages in thread From: Pavel Begunkov @ 2024-10-11 21:38 UTC (permalink / raw) To: Bernd Schubert, io-uring Cc: Jens Axboe, Miklos Szeredi, Joanne Koong, Josef Bacik On 10/10/24 21:56, Bernd Schubert wrote: > Hello, > > as discussed during LPC, we would like to have large CQE sizes, at least > 256B. Ideally 256B for fuse, but CQE512 might be a bit too much... > > Pavel said that this should be ok, but it would be better to have the CQE > size as function argument. > Could you give me some hints how this should look like and especially how I remembered it as SQEs, which would've been much easier. In the current io_uring infra, usually to post a cqe an opcode specific path stashed the result value in the request structure and then the core io_uring code will post it for you. We won't find space for 256B, however, and it'd need to happen right from the cmd path and follow the rules when / from what context it can be posted. I'll take a stub to see how it can look like. > we are going to communicate the CQE size to the kernel? I guess just adding > IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier. 3-4 special cases is already odd as an API, we should rather just pass the desired CQE size. > I'm basically through with other changes Miklos had been asking for and > moving fuse headers into the CQE is next. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-10 20:56 Large CQE for fuse headers Bernd Schubert 2024-10-11 17:57 ` Jens Axboe 2024-10-11 21:38 ` Pavel Begunkov @ 2024-10-12 1:55 ` Ming Lei 2024-10-12 14:38 ` Jens Axboe 2 siblings, 1 reply; 23+ messages in thread From: Ming Lei @ 2024-10-12 1:55 UTC (permalink / raw) To: Bernd Schubert Cc: io-uring, Pavel Begunkov, Jens Axboe, Miklos Szeredi, Joanne Koong, Josef Bacik On Fri, Oct 11, 2024 at 4:56 AM Bernd Schubert <[email protected]> wrote: > > Hello, > > as discussed during LPC, we would like to have large CQE sizes, at least > 256B. Ideally 256B for fuse, but CQE512 might be a bit too much... > > Pavel said that this should be ok, but it would be better to have the CQE > size as function argument. > Could you give me some hints how this should look like and especially how > we are going to communicate the CQE size to the kernel? I guess just adding > IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier. > > I'm basically through with other changes Miklos had been asking for and > moving fuse headers into the CQE is next. Big CQE may not be efficient, there are copy from kernel to CQE and from CQE to userspace. And not flexible, it is one ring-wide property, if it is big, any CQE from this ring has to be big. If you are saying uring_cmd, another way is to mapped one area for this purpose, the fuse driver can write fuse headers to this indexed mmap buffer, and userspace read it, which is just efficient, without io_uring core changes. ublk uses this way to fill IO request header. But it requires each command to have a unique tag. thanks, Ming Lei ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-12 1:55 ` Ming Lei @ 2024-10-12 14:38 ` Jens Axboe 2024-10-13 21:20 ` Bernd Schubert 0 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2024-10-12 14:38 UTC (permalink / raw) To: Ming Lei, Bernd Schubert Cc: io-uring, Pavel Begunkov, Miklos Szeredi, Joanne Koong, Josef Bacik On 10/11/24 7:55 PM, Ming Lei wrote: > On Fri, Oct 11, 2024 at 4:56?AM Bernd Schubert > <[email protected]> wrote: >> >> Hello, >> >> as discussed during LPC, we would like to have large CQE sizes, at least >> 256B. Ideally 256B for fuse, but CQE512 might be a bit too much... >> >> Pavel said that this should be ok, but it would be better to have the CQE >> size as function argument. >> Could you give me some hints how this should look like and especially how >> we are going to communicate the CQE size to the kernel? I guess just adding >> IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier. >> >> I'm basically through with other changes Miklos had been asking for and >> moving fuse headers into the CQE is next. > > Big CQE may not be efficient, there are copy from kernel to CQE and > from CQE to userspace. And not flexible, it is one ring-wide property, > if it is big, > any CQE from this ring has to be big. There isn't really a copy - the kernel fills it in, generally the application itself, just in the kernel, and then the application can read it on that side. It's the same memory, and it'll also generally be cache hot when the applicatio reaps it. Unless a lot of time has passed, obviously. That said, yeah bigger sqe/cqe is less ideal than smaller ones, obviously. Currently you can fit 4 normal cqes in a cache line, or a single sqe. Making either of them bigger will obviously bloat that. > If you are saying uring_cmd, another way is to mapped one area for > this purpose, the fuse driver can write fuse headers to this indexed > mmap buffer, and userspace read it, which is just efficient, without > io_uring core changes. ublk uses this way to fill IO request header. > But it requires each command to have a unique tag. That may indeed be a decent idea for this too. You don't even need fancy tagging, you can just use the cqe index for your tag too, as it should not be bigger than the the cq ring space. Then you can get away with just using normal cqe sizes, and just have a shared region between the two where data gets written by the uring_cmd completion, and the app can access it directly from userspace. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-12 14:38 ` Jens Axboe @ 2024-10-13 21:20 ` Bernd Schubert 2024-10-14 2:44 ` Ming Lei 2024-10-14 10:31 ` Miklos Szeredi 0 siblings, 2 replies; 23+ messages in thread From: Bernd Schubert @ 2024-10-13 21:20 UTC (permalink / raw) To: Jens Axboe, Ming Lei Cc: io-uring, Pavel Begunkov, Miklos Szeredi, Joanne Koong, Josef Bacik On 10/12/24 16:38, Jens Axboe wrote: > On 10/11/24 7:55 PM, Ming Lei wrote: >> On Fri, Oct 11, 2024 at 4:56?AM Bernd Schubert >> <[email protected]> wrote: >>> >>> Hello, >>> >>> as discussed during LPC, we would like to have large CQE sizes, at least >>> 256B. Ideally 256B for fuse, but CQE512 might be a bit too much... >>> >>> Pavel said that this should be ok, but it would be better to have the CQE >>> size as function argument. >>> Could you give me some hints how this should look like and especially how >>> we are going to communicate the CQE size to the kernel? I guess just adding >>> IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier. >>> >>> I'm basically through with other changes Miklos had been asking for and >>> moving fuse headers into the CQE is next. >> >> Big CQE may not be efficient, there are copy from kernel to CQE and >> from CQE to userspace. And not flexible, it is one ring-wide property, >> if it is big, >> any CQE from this ring has to be big. > > There isn't really a copy - the kernel fills it in, generally the > application itself, just in the kernel, and then the application can > read it on that side. It's the same memory, and it'll also generally be > cache hot when the applicatio reaps it. Unless a lot of time has passed, > obviously. > > That said, yeah bigger sqe/cqe is less ideal than smaller ones, > obviously. Currently you can fit 4 normal cqes in a cache line, or a > single sqe. Making either of them bigger will obviously bloat that. > >> If you are saying uring_cmd, another way is to mapped one area for >> this purpose, the fuse driver can write fuse headers to this indexed >> mmap buffer, and userspace read it, which is just efficient, without >> io_uring core changes. ublk uses this way to fill IO request header. >> But it requires each command to have a unique tag. > > That may indeed be a decent idea for this too. You don't even need fancy > tagging, you can just use the cqe index for your tag too, as it should > not be bigger than the the cq ring space. Then you can get away with > just using normal cqe sizes, and just have a shared region between the > two where data gets written by the uring_cmd completion, and the app can > access it directly from userspace. Would be good if Miklos could chime in here, adding back mmap for headers wouldn't be difficult, but would add back more fuse-uring startup and tear-down code. From performance point of view, I don't know anything about CPU cache prefetching, but shouldn't the cpu cache logic be able to easily prefetch larger linear io-uring rings into 2nd/3rd level caches? And if if the fuse header is in a separated buffer, it can't auto prefetch that without additional instructions? I.e. how would the cpu cache logic auto know about these additional memory areas? Thanks, Bernd ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-13 21:20 ` Bernd Schubert @ 2024-10-14 2:44 ` Ming Lei 2024-10-14 11:10 ` Miklos Szeredi 2024-10-14 10:31 ` Miklos Szeredi 1 sibling, 1 reply; 23+ messages in thread From: Ming Lei @ 2024-10-14 2:44 UTC (permalink / raw) To: Bernd Schubert Cc: Jens Axboe, io-uring, Pavel Begunkov, Miklos Szeredi, Joanne Koong, Josef Bacik On Sun, Oct 13, 2024 at 11:20:53PM +0200, Bernd Schubert wrote: > > > On 10/12/24 16:38, Jens Axboe wrote: > > On 10/11/24 7:55 PM, Ming Lei wrote: > >> On Fri, Oct 11, 2024 at 4:56?AM Bernd Schubert > >> <[email protected]> wrote: > >>> > >>> Hello, > >>> > >>> as discussed during LPC, we would like to have large CQE sizes, at least > >>> 256B. Ideally 256B for fuse, but CQE512 might be a bit too much... > >>> > >>> Pavel said that this should be ok, but it would be better to have the CQE > >>> size as function argument. > >>> Could you give me some hints how this should look like and especially how > >>> we are going to communicate the CQE size to the kernel? I guess just adding > >>> IORING_SETUP_CQE256 / IORING_SETUP_CQE512 would be much easier. > >>> > >>> I'm basically through with other changes Miklos had been asking for and > >>> moving fuse headers into the CQE is next. > >> > >> Big CQE may not be efficient, there are copy from kernel to CQE and > >> from CQE to userspace. And not flexible, it is one ring-wide property, > >> if it is big, > >> any CQE from this ring has to be big. > > > > There isn't really a copy - the kernel fills it in, generally the > > application itself, just in the kernel, and then the application can > > read it on that side. It's the same memory, and it'll also generally be > > cache hot when the applicatio reaps it. Unless a lot of time has passed, > > obviously. > > > > That said, yeah bigger sqe/cqe is less ideal than smaller ones, > > obviously. Currently you can fit 4 normal cqes in a cache line, or a > > single sqe. Making either of them bigger will obviously bloat that. > > > >> If you are saying uring_cmd, another way is to mapped one area for > >> this purpose, the fuse driver can write fuse headers to this indexed > >> mmap buffer, and userspace read it, which is just efficient, without > >> io_uring core changes. ublk uses this way to fill IO request header. > >> But it requires each command to have a unique tag. > > > > That may indeed be a decent idea for this too. You don't even need fancy > > tagging, you can just use the cqe index for your tag too, as it should > > not be bigger than the the cq ring space. Then you can get away with > > just using normal cqe sizes, and just have a shared region between the > > two where data gets written by the uring_cmd completion, and the app can > > access it directly from userspace. > > Would be good if Miklos could chime in here, adding back mmap for headers > wouldn't be difficult, but would add back more fuse-uring startup and > tear-down code. > > From performance point of view, I don't know anything about CPU cache > prefetching, but shouldn't the cpu cache logic be able to easily prefetch > larger linear io-uring rings into 2nd/3rd level caches? And if if the > fuse header is in a separated buffer, it can't auto prefetch that > without additional instructions? I.e. how would the cpu cache logic > auto know about these additional memory areas? It also depends on how fuse user code consumes the big CQE payload, if fuse header needs to keep in memory a bit long, you may have to copy it somewhere for post-processing since io_uring(kernel) needs CQE to be returned back asap. Thanks, Ming ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-14 2:44 ` Ming Lei @ 2024-10-14 11:10 ` Miklos Szeredi 2024-10-14 12:47 ` Bernd Schubert 2024-10-14 13:20 ` Bernd Schubert 0 siblings, 2 replies; 23+ messages in thread From: Miklos Szeredi @ 2024-10-14 11:10 UTC (permalink / raw) To: Ming Lei Cc: Bernd Schubert, Jens Axboe, io-uring, Pavel Begunkov, Joanne Koong, Josef Bacik On Mon, 14 Oct 2024 at 04:44, Ming Lei <[email protected]> wrote: > It also depends on how fuse user code consumes the big CQE payload, if > fuse header needs to keep in memory a bit long, you may have to copy it > somewhere for post-processing since io_uring(kernel) needs CQE to be > returned back asap. Yes. I'm not quite sure how the libfuse interface will work to accommodate this. Currently if the server needs to delay the processing of a request it would have to copy all arguments, since validity will not be guaranteed after the callback returns. With the io_uring infrastructure the headers would need to be copied, but the data buffer would be per-request and would not need copying. This is relaxing a requirement so existing servers would continue to work fine, but would not be able to take full advantage of the multi-buffer design. Bernd do you have an idea how this would work? Thanks, Miklos ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-14 11:10 ` Miklos Szeredi @ 2024-10-14 12:47 ` Bernd Schubert 2024-10-14 13:34 ` Pavel Begunkov 2024-10-14 13:20 ` Bernd Schubert 1 sibling, 1 reply; 23+ messages in thread From: Bernd Schubert @ 2024-10-14 12:47 UTC (permalink / raw) To: Miklos Szeredi, Ming Lei Cc: Jens Axboe, io-uring, Pavel Begunkov, Joanne Koong, Josef Bacik On 10/14/24 13:10, Miklos Szeredi wrote: > On Mon, 14 Oct 2024 at 04:44, Ming Lei <[email protected]> wrote: > >> It also depends on how fuse user code consumes the big CQE payload, if >> fuse header needs to keep in memory a bit long, you may have to copy it >> somewhere for post-processing since io_uring(kernel) needs CQE to be >> returned back asap. > > Yes. > > I'm not quite sure how the libfuse interface will work to accommodate > this. Currently if the server needs to delay the processing of a > request it would have to copy all arguments, since validity will not > be guaranteed after the callback returns. With the io_uring > infrastructure the headers would need to be copied, but the data > buffer would be per-request and would not need copying. This is > relaxing a requirement so existing servers would continue to work > fine, but would not be able to take full advantage of the multi-buffer > design. > > Bernd do you have an idea how this would work? I assume returning a CQE is io_uring_cq_advance()? In my current libfuse io_uring branch that only happens when all CQEs have been processed. We could also easily switch to io_uring_cqe_seen() to do it per CQE. I don't understand why we need to return CQEs asap, assuming CQ ring size is the same as SQ ring size - why does it matter? If we indeed need to return the CQE before processing the request, it indeed would be better to have a 2nd memory buffer associated with the fuse request. Thanks, Bernd ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-14 12:47 ` Bernd Schubert @ 2024-10-14 13:34 ` Pavel Begunkov 2024-10-14 15:21 ` Bernd Schubert 0 siblings, 1 reply; 23+ messages in thread From: Pavel Begunkov @ 2024-10-14 13:34 UTC (permalink / raw) To: Bernd Schubert, Miklos Szeredi, Ming Lei Cc: Jens Axboe, io-uring, Joanne Koong, Josef Bacik On 10/14/24 13:47, Bernd Schubert wrote: > On 10/14/24 13:10, Miklos Szeredi wrote: >> On Mon, 14 Oct 2024 at 04:44, Ming Lei <[email protected]> wrote: >> >>> It also depends on how fuse user code consumes the big CQE payload, if >>> fuse header needs to keep in memory a bit long, you may have to copy it >>> somewhere for post-processing since io_uring(kernel) needs CQE to be >>> returned back asap. >> >> Yes. >> >> I'm not quite sure how the libfuse interface will work to accommodate >> this. Currently if the server needs to delay the processing of a >> request it would have to copy all arguments, since validity will not >> be guaranteed after the callback returns. With the io_uring >> infrastructure the headers would need to be copied, but the data >> buffer would be per-request and would not need copying. This is >> relaxing a requirement so existing servers would continue to work >> fine, but would not be able to take full advantage of the multi-buffer >> design. >> >> Bernd do you have an idea how this would work? > > I assume returning a CQE is io_uring_cq_advance()? Yes > In my current libfuse io_uring branch that only happens when > all CQEs have been processed. We could also easily switch to > io_uring_cqe_seen() to do it per CQE. Either that one. > I don't understand why we need to return CQEs asap, assuming CQ > ring size is the same as SQ ring size - why does it matter? The SQE is consumed once the request is issued, but nothing prevents the user to keep the QD larger than the SQ size, e.g. do M syscalls each ending N requests and then wait for N * M completions. > If we indeed need to return the CQE before processing the request, > it indeed would be better to have a 2nd memory buffer associated with > the fuse request. With that said, the usual problem is to size the CQ so that it (almost) never overflows, otherwise it hurts performance. With DEFER_TASKRUN you can delay returning CQEs to the kernel until the next time you wait for completions, i.e. do io_uring waiting syscall. Without the flag, CQEs may come asynchronously to the user, so need a bit more consideration. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-14 13:34 ` Pavel Begunkov @ 2024-10-14 15:21 ` Bernd Schubert 2024-10-14 17:48 ` Pavel Begunkov 0 siblings, 1 reply; 23+ messages in thread From: Bernd Schubert @ 2024-10-14 15:21 UTC (permalink / raw) To: Pavel Begunkov, Miklos Szeredi, Ming Lei Cc: Jens Axboe, io-uring, Joanne Koong, Josef Bacik On 10/14/24 15:34, Pavel Begunkov wrote: > On 10/14/24 13:47, Bernd Schubert wrote: >> On 10/14/24 13:10, Miklos Szeredi wrote: >>> On Mon, 14 Oct 2024 at 04:44, Ming Lei <[email protected]> wrote: >>> >>>> It also depends on how fuse user code consumes the big CQE payload, if >>>> fuse header needs to keep in memory a bit long, you may have to copy it >>>> somewhere for post-processing since io_uring(kernel) needs CQE to be >>>> returned back asap. >>> >>> Yes. >>> >>> I'm not quite sure how the libfuse interface will work to accommodate >>> this. Currently if the server needs to delay the processing of a >>> request it would have to copy all arguments, since validity will not >>> be guaranteed after the callback returns. With the io_uring >>> infrastructure the headers would need to be copied, but the data >>> buffer would be per-request and would not need copying. This is >>> relaxing a requirement so existing servers would continue to work >>> fine, but would not be able to take full advantage of the multi-buffer >>> design. >>> >>> Bernd do you have an idea how this would work? >> >> I assume returning a CQE is io_uring_cq_advance()? > > Yes > >> In my current libfuse io_uring branch that only happens when >> all CQEs have been processed. We could also easily switch to >> io_uring_cqe_seen() to do it per CQE. > > Either that one. > >> I don't understand why we need to return CQEs asap, assuming CQ >> ring size is the same as SQ ring size - why does it matter? > > The SQE is consumed once the request is issued, but nothing > prevents the user to keep the QD larger than the SQ size, > e.g. do M syscalls each ending N requests and then wait for > N * M completions. > I need a bit help to understand this. Do you mean that in typical io-uring usage SQEs get submitted, already released in kernel and then users submit even more SQEs? And that creates a kernel queue depth for completion? I guess as long as libfuse does not expose the ring we don't have that issue. But then yeah, exposing the ring to fuse-server/daemon is planned... >> If we indeed need to return the CQE before processing the request, >> it indeed would be better to have a 2nd memory buffer associated with >> the fuse request. > > With that said, the usual problem is to size the CQ so that it > (almost) never overflows, otherwise it hurts performance. With > DEFER_TASKRUN you can delay returning CQEs to the kernel until > the next time you wait for completions, i.e. do io_uring waiting > syscall. Without the flag, CQEs may come asynchronously to the > user, so need a bit more consideration. > Current libfuse code has it disabled IORING_SETUP_SINGLE_ISSUER, IORING_SETUP_DEFER_TASKRUN, IORING_SETUP_TASKRUN_FLAG and IORING_SETUP_COOP_TASKRUN as these are somehow slowing down things. Not sure if this thread is optimal to discuss this. I would also first like to sort out all the other design topics before going into fine-tuning... Thanks, Bernd ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-14 15:21 ` Bernd Schubert @ 2024-10-14 17:48 ` Pavel Begunkov 2024-10-14 21:27 ` Bernd Schubert 0 siblings, 1 reply; 23+ messages in thread From: Pavel Begunkov @ 2024-10-14 17:48 UTC (permalink / raw) To: Bernd Schubert, Miklos Szeredi, Ming Lei Cc: Jens Axboe, io-uring, Joanne Koong, Josef Bacik On 10/14/24 16:21, Bernd Schubert wrote: > On 10/14/24 15:34, Pavel Begunkov wrote: >> On 10/14/24 13:47, Bernd Schubert wrote: >>> On 10/14/24 13:10, Miklos Szeredi wrote: >>>> On Mon, 14 Oct 2024 at 04:44, Ming Lei <[email protected]> wrote: >>>> >>>>> It also depends on how fuse user code consumes the big CQE payload, if >>>>> fuse header needs to keep in memory a bit long, you may have to copy it >>>>> somewhere for post-processing since io_uring(kernel) needs CQE to be >>>>> returned back asap. >>>> >>>> Yes. >>>> >>>> I'm not quite sure how the libfuse interface will work to accommodate >>>> this. Currently if the server needs to delay the processing of a >>>> request it would have to copy all arguments, since validity will not >>>> be guaranteed after the callback returns. With the io_uring >>>> infrastructure the headers would need to be copied, but the data >>>> buffer would be per-request and would not need copying. This is >>>> relaxing a requirement so existing servers would continue to work >>>> fine, but would not be able to take full advantage of the multi-buffer >>>> design. >>>> >>>> Bernd do you have an idea how this would work? >>> >>> I assume returning a CQE is io_uring_cq_advance()? >> >> Yes >> >>> In my current libfuse io_uring branch that only happens when >>> all CQEs have been processed. We could also easily switch to >>> io_uring_cqe_seen() to do it per CQE. >> >> Either that one. >> >>> I don't understand why we need to return CQEs asap, assuming CQ >>> ring size is the same as SQ ring size - why does it matter? >> >> The SQE is consumed once the request is issued, but nothing >> prevents the user to keep the QD larger than the SQ size, >> e.g. do M syscalls each ending N requests and then wait for typo, Sending or queueing N requests. In other words it's perfectly legal to: It's perfectly legal to: ring = create_ring(nr_cqes=N); for (i = 0 .. M) { for (i = 0..N) prep_sqe(); submit_all_sqes(); } wait(nr=N * M); With a caveat that the wait can't complete more than the CQ size, but you can even add a loop atop of the wait. while (nr_inflight_cqes) { wait(nr = min(CQ_size, nr_inflight_cqes); process_cqes(); } Or do something more elaborate, often frameworks allow to push any number of requests not caring too much about exactly matching queue sizes apart from sizing them for performance reasons. >> N * M completions. >> > > I need a bit help to understand this. Do you mean that in typical > io-uring usage SQEs get submitted, already released in kernel Typical or not, but the number of requests in flight is not limited by the size of the SQ, it only limits how many requests you can queue per syscall, i.e. per io_uring_submit(). > and then users submit even more SQEs? And that creates a > kernel queue depth for completion? > I guess as long as libfuse does not expose the ring we don't have > that issue. But then yeah, exposing the ring to fuse-server/daemon > is planned... Could be, for example you don't need to care about overflows at all if the CQ size is always larger than the number of requests in flight. Perhaps the simplest example: prep_requests(nr=N); wait_cq(nr=N); process_cqes(nr=N); >>> If we indeed need to return the CQE before processing the request, >>> it indeed would be better to have a 2nd memory buffer associated with >>> the fuse request. >> >> With that said, the usual problem is to size the CQ so that it >> (almost) never overflows, otherwise it hurts performance. With >> DEFER_TASKRUN you can delay returning CQEs to the kernel until >> the next time you wait for completions, i.e. do io_uring waiting >> syscall. Without the flag, CQEs may come asynchronously to the >> user, so need a bit more consideration. >> > > Current libfuse code has it disabled IORING_SETUP_SINGLE_ISSUER, > IORING_SETUP_DEFER_TASKRUN, IORING_SETUP_TASKRUN_FLAG and > IORING_SETUP_COOP_TASKRUN as these are somehow slowing down > things. Those flags are not a requirement, you can try to size the CQ so that overflows are rare, it's just a bit easier to do with DEFER_TASKRUN. > Not sure if this thread is optimal to discuss this. I would > also first like to sort out all the other design topics before > going into fine-tuning... -- Pavel Begunkov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-14 17:48 ` Pavel Begunkov @ 2024-10-14 21:27 ` Bernd Schubert 2024-10-16 10:54 ` Miklos Szeredi 0 siblings, 1 reply; 23+ messages in thread From: Bernd Schubert @ 2024-10-14 21:27 UTC (permalink / raw) To: Pavel Begunkov, Miklos Szeredi, Ming Lei Cc: Jens Axboe, io-uring, Joanne Koong, Josef Bacik On 10/14/24 19:48, Pavel Begunkov wrote: > On 10/14/24 16:21, Bernd Schubert wrote: >> On 10/14/24 15:34, Pavel Begunkov wrote: >>> On 10/14/24 13:47, Bernd Schubert wrote: >>>> On 10/14/24 13:10, Miklos Szeredi wrote: >>>>> On Mon, 14 Oct 2024 at 04:44, Ming Lei <[email protected]> wrote: >>>>> >>>>>> It also depends on how fuse user code consumes the big CQE >>>>>> payload, if >>>>>> fuse header needs to keep in memory a bit long, you may have to >>>>>> copy it >>>>>> somewhere for post-processing since io_uring(kernel) needs CQE to be >>>>>> returned back asap. >>>>> >>>>> Yes. >>>>> >>>>> I'm not quite sure how the libfuse interface will work to accommodate >>>>> this. Currently if the server needs to delay the processing of a >>>>> request it would have to copy all arguments, since validity will not >>>>> be guaranteed after the callback returns. With the io_uring >>>>> infrastructure the headers would need to be copied, but the data >>>>> buffer would be per-request and would not need copying. This is >>>>> relaxing a requirement so existing servers would continue to work >>>>> fine, but would not be able to take full advantage of the multi-buffer >>>>> design. >>>>> >>>>> Bernd do you have an idea how this would work? >>>> >>>> I assume returning a CQE is io_uring_cq_advance()? >>> >>> Yes >>> >>>> In my current libfuse io_uring branch that only happens when >>>> all CQEs have been processed. We could also easily switch to >>>> io_uring_cqe_seen() to do it per CQE. >>> >>> Either that one. >>> >>>> I don't understand why we need to return CQEs asap, assuming CQ >>>> ring size is the same as SQ ring size - why does it matter? >>> >>> The SQE is consumed once the request is issued, but nothing >>> prevents the user to keep the QD larger than the SQ size, >>> e.g. do M syscalls each ending N requests and then wait for > > typo, Sending or queueing N requests. In other words it's > perfectly legal to: > > It's perfectly legal to: > > ring = create_ring(nr_cqes=N); > for (i = 0 .. M) { > for (i = 0..N) > prep_sqe(); > submit_all_sqes(); > } > wait(nr=N * M); > > > With a caveat that the wait can't complete more than the > CQ size, but you can even add a loop atop of the wait. > > while (nr_inflight_cqes) { > wait(nr = min(CQ_size, nr_inflight_cqes); > process_cqes(); > } > > Or do something more elaborate, often frameworks allow > to push any number of requests not caring too much about > exactly matching queue sizes apart from sizing them for > performance reasons. > >>> N * M completions. >>> >> >> I need a bit help to understand this. Do you mean that in typical >> io-uring usage SQEs get submitted, already released in kernel > > Typical or not, but the number of requests in flight is not > limited by the size of the SQ, it only limits how many > requests you can queue per syscall, i.e. per io_uring_submit(). > > >> and then users submit even more SQEs? And that creates a >> kernel queue depth for completion? >> I guess as long as libfuse does not expose the ring we don't have >> that issue. But then yeah, exposing the ring to fuse-server/daemon >> is planned... > > Could be, for example you don't need to care about overflows > at all if the CQ size is always larger than the number of > requests in flight. Perhaps the simplest example: > > prep_requests(nr=N); > wait_cq(nr=N); > process_cqes(nr=N); With only libfuse as ring user it is more like prep_requests(nr=N); wait_cq(1); ==> we must not wait for more than 1 as more might never arrive io_uring_for_each_cqe { } I still think no issue with libfuse (or any other fuse lib) as single ring-user, but if the same ring then gets used for more all of that might come up. @Miklos maybe we avoid using large CQEs/SQEs and instead set up our own separate buffer for FUSE headers? Thanks, Bernd ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-14 21:27 ` Bernd Schubert @ 2024-10-16 10:54 ` Miklos Szeredi 2024-10-16 11:53 ` Bernd Schubert 0 siblings, 1 reply; 23+ messages in thread From: Miklos Szeredi @ 2024-10-16 10:54 UTC (permalink / raw) To: Bernd Schubert Cc: Pavel Begunkov, Ming Lei, Jens Axboe, io-uring, Joanne Koong, Josef Bacik On Mon, 14 Oct 2024 at 23:27, Bernd Schubert <[email protected]> wrote: > With only libfuse as ring user it is more like > > prep_requests(nr=N); > wait_cq(1); ==> we must not wait for more than 1 as more might never arrive > io_uring_for_each_cqe { > } Right. I think the point Pavel is trying to make is that io_uring queue sizes don't have to match fuse queue size. So we could have sq_entries=4, cq_entries=4 and have the server queue 64 FUSE_URING_REQ_FETCH commands, it just has to do that in batches of 4 max. > @Miklos maybe we avoid using large CQEs/SQEs and instead set up our own > separate buffer for FUSE headers? The only gain from this would be in the case where the uring is used for non-fuse requests as well, in which case the extra space in the queue entries would be unused (i.e. 48 unused bytes in the cacheline). I don't know if this is a realistic use case or not. It's definitely a challenge to create a library API that allows this. The disadvantage would be a more complex interface. Thanks, Miklos ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-16 10:54 ` Miklos Szeredi @ 2024-10-16 11:53 ` Bernd Schubert 2024-10-16 12:24 ` Miklos Szeredi 2024-10-17 0:59 ` Ming Lei 0 siblings, 2 replies; 23+ messages in thread From: Bernd Schubert @ 2024-10-16 11:53 UTC (permalink / raw) To: Miklos Szeredi Cc: Pavel Begunkov, Ming Lei, Jens Axboe, io-uring, Joanne Koong, Josef Bacik On 10/16/24 12:54, Miklos Szeredi wrote: > On Mon, 14 Oct 2024 at 23:27, Bernd Schubert <[email protected]> wrote: > >> With only libfuse as ring user it is more like >> >> prep_requests(nr=N); >> wait_cq(1); ==> we must not wait for more than 1 as more might never arrive >> io_uring_for_each_cqe { >> } > > Right. > > I think the point Pavel is trying to make is that io_uring queue > sizes don't have to match fuse queue size. So we could have > sq_entries=4, cq_entries=4 and have the server queue 64 > FUSE_URING_REQ_FETCH commands, it just has to do that in batches of 4 > max. Hmm ok, I guess that might matter when payload is small compared to SQ/CQ size and the system is low in memory. > >> @Miklos maybe we avoid using large CQEs/SQEs and instead set up our own >> separate buffer for FUSE headers? > > The only gain from this would be in the case where the uring is used > for non-fuse requests as well, in which case the extra space in the > queue entries would be unused (i.e. 48 unused bytes in the cacheline). > I don't know if this is a realistic use case or not. It's definitely > a challenge to create a library API that allows this. > > The disadvantage would be a more complex interface. I don't think that complicated. In the end it is just another pointer that needs to be mapped. We don't even need to use mmap. At least for zero-copy we will need to the ring non-fuse requests. For the DDN use case, we are using another io-uring for tcp requests, I would actually like to switch that to the same ring. Thanks, Bernd ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-16 11:53 ` Bernd Schubert @ 2024-10-16 12:24 ` Miklos Szeredi 2024-10-17 0:59 ` Ming Lei 1 sibling, 0 replies; 23+ messages in thread From: Miklos Szeredi @ 2024-10-16 12:24 UTC (permalink / raw) To: Bernd Schubert Cc: Pavel Begunkov, Ming Lei, Jens Axboe, io-uring, Joanne Koong, Josef Bacik On Wed, 16 Oct 2024 at 13:53, Bernd Schubert <[email protected]> wrote: > I don't think that complicated. In the end it is just another pointer > that needs to be mapped. We don't even need to use mmap. > At least for zero-copy we will need to the ring non-fuse requests. > For the DDN use case, we are using another io-uring for tcp requests, > I would actually like to switch that to the same ring. Okay, let's try and see how that works. Thanks, Miklos ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-16 11:53 ` Bernd Schubert 2024-10-16 12:24 ` Miklos Szeredi @ 2024-10-17 0:59 ` Ming Lei 1 sibling, 0 replies; 23+ messages in thread From: Ming Lei @ 2024-10-17 0:59 UTC (permalink / raw) To: Bernd Schubert Cc: Miklos Szeredi, Pavel Begunkov, Jens Axboe, io-uring, Joanne Koong, Josef Bacik On Wed, Oct 16, 2024 at 01:53:00PM +0200, Bernd Schubert wrote: > > > On 10/16/24 12:54, Miklos Szeredi wrote: > > On Mon, 14 Oct 2024 at 23:27, Bernd Schubert <[email protected]> wrote: > > > >> With only libfuse as ring user it is more like > >> > >> prep_requests(nr=N); > >> wait_cq(1); ==> we must not wait for more than 1 as more might never arrive > >> io_uring_for_each_cqe { > >> } > > > > Right. > > > > I think the point Pavel is trying to make is that io_uring queue > > sizes don't have to match fuse queue size. So we could have > > sq_entries=4, cq_entries=4 and have the server queue 64 > > FUSE_URING_REQ_FETCH commands, it just has to do that in batches of 4 > > max. > > Hmm ok, I guess that might matter when payload is small compared to > SQ/CQ size and the system is low in memory. > > > > >> @Miklos maybe we avoid using large CQEs/SQEs and instead set up our own > >> separate buffer for FUSE headers? > > > > The only gain from this would be in the case where the uring is used > > for non-fuse requests as well, in which case the extra space in the > > queue entries would be unused (i.e. 48 unused bytes in the cacheline). > > I don't know if this is a realistic use case or not. It's definitely > > a challenge to create a library API that allows this. > > > > The disadvantage would be a more complex interface. > > I don't think that complicated. In the end it is just another pointer > that needs to be mapped. We don't even need to use mmap. > At least for zero-copy we will need to the ring non-fuse requests. > For the DDN use case, we are using another io-uring for tcp requests, > I would actually like to switch that to the same ring. I remember the biggest trouble of using same ring in ublk could be exporting the ring for API users, but it is often per-task, seems not too hard to deal with. The pros is you needn't use eventfd to communicate with fuse command uring(thread) any more, and more uring IOs can be handled in single batch. Performance is better, with less task switch involved, without extra communication. Thanks, Ming ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-14 11:10 ` Miklos Szeredi 2024-10-14 12:47 ` Bernd Schubert @ 2024-10-14 13:20 ` Bernd Schubert 1 sibling, 0 replies; 23+ messages in thread From: Bernd Schubert @ 2024-10-14 13:20 UTC (permalink / raw) To: Miklos Szeredi, Ming Lei Cc: Jens Axboe, io-uring, Pavel Begunkov, Joanne Koong, Josef Bacik, Antonio SJ Musumeci On 10/14/24 13:10, Miklos Szeredi wrote: > On Mon, 14 Oct 2024 at 04:44, Ming Lei <[email protected]> wrote: > >> It also depends on how fuse user code consumes the big CQE payload, if >> fuse header needs to keep in memory a bit long, you may have to copy it >> somewhere for post-processing since io_uring(kernel) needs CQE to be >> returned back asap. > > Yes. > > I'm not quite sure how the libfuse interface will work to accommodate > this. Currently if the server needs to delay the processing of a > request it would have to copy all arguments, since validity will not > be guaranteed after the callback returns. With the io_uring Well, it depends on the libfuse implementation. In plain libfuse the buffer is associated with the the thread. This could be improved by creating a request pool and buffers per request. AFAIK, Antonio has done that for mergerfs. > infrastructure the headers would need to be copied, but the data > buffer would be per-request and would not need copying. This is > relaxing a requirement so existing servers would continue to work Yep, that is actually how we use it at ddn for requests over io-uring. > fine, but would not be able to take full advantage of the multi-buffer > design. What do you actually mean with "multi-buffer design"? Thanks, Bernd ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Large CQE for fuse headers 2024-10-13 21:20 ` Bernd Schubert 2024-10-14 2:44 ` Ming Lei @ 2024-10-14 10:31 ` Miklos Szeredi 1 sibling, 0 replies; 23+ messages in thread From: Miklos Szeredi @ 2024-10-14 10:31 UTC (permalink / raw) To: Bernd Schubert Cc: Jens Axboe, Ming Lei, io-uring, Pavel Begunkov, Joanne Koong, Josef Bacik On Sun, 13 Oct 2024 at 23:20, Bernd Schubert <[email protected]> wrote: > > > > On 10/12/24 16:38, Jens Axboe wrote: > > That may indeed be a decent idea for this too. You don't even need fancy > > tagging, you can just use the cqe index for your tag too, as it should > > not be bigger than the the cq ring space. Then you can get away with > > just using normal cqe sizes, and just have a shared region between the > > two where data gets written by the uring_cmd completion, and the app can > > access it directly from userspace. > > Would be good if Miklos could chime in here, adding back mmap for headers > wouldn't be difficult, but would add back more fuse-uring startup and > tear-down code. My worry is making the API more complex, OTOH I understand the need for io_uring to refrain from adding fuse specific features. Also seems like io_uring is accounting some of the pinned memory, but for the queues themselves it does not do that, even though the max number of sqes (32k) can take substantial amount of memory. Growing the cqe would make this worse, but this could be fixed by adding the missing accounting, possibly only if using non-standard cqe sizes to avoid breaking backward comatibility. Thanks, Miklos ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2024-10-17 0:59 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-10-10 20:56 Large CQE for fuse headers Bernd Schubert 2024-10-11 17:57 ` Jens Axboe 2024-10-11 18:35 ` Bernd Schubert 2024-10-11 18:39 ` Jens Axboe 2024-10-11 19:03 ` Bernd Schubert 2024-10-11 19:24 ` Jens Axboe 2024-10-11 21:38 ` Pavel Begunkov 2024-10-12 1:55 ` Ming Lei 2024-10-12 14:38 ` Jens Axboe 2024-10-13 21:20 ` Bernd Schubert 2024-10-14 2:44 ` Ming Lei 2024-10-14 11:10 ` Miklos Szeredi 2024-10-14 12:47 ` Bernd Schubert 2024-10-14 13:34 ` Pavel Begunkov 2024-10-14 15:21 ` Bernd Schubert 2024-10-14 17:48 ` Pavel Begunkov 2024-10-14 21:27 ` Bernd Schubert 2024-10-16 10:54 ` Miklos Szeredi 2024-10-16 11:53 ` Bernd Schubert 2024-10-16 12:24 ` Miklos Szeredi 2024-10-17 0:59 ` Ming Lei 2024-10-14 13:20 ` Bernd Schubert 2024-10-14 10:31 ` Miklos Szeredi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox