Bug? CQE.res = -EAGAIN with nvme multipath driver

public inbox for [email protected]
 help / color / mirror / Atom feed

* Bug? CQE.res = -EAGAIN with nvme multipath driver
@ 2025-01-06 20:03 Haeuptle, Michael
  2025-01-06 23:53 ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Haeuptle, Michael @ 2025-01-06 20:03 UTC (permalink / raw)
  To: [email protected]

Hello,

I’m using the nvme multipath driver (NVMF/RDMA) and io-uring. When a path goes away, I sometimes get a CQE.res = -EAGAIN in user space.
This is unexpected since the nvme multipath driver should handle this transparently. It’s somewhat workload related but easy to reproduce with fio.

The multipath driver uses kblockd worker to re-queue the failed NVME bios (https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/drivers/nvme/host/multipath.c#L126).
The original request is ended. 

When the nvme_requeue_work callback is executed, the blk layer tries to allocate a new request for the bios but that fails and the bio status is set to BLK_STS_AGAIN (https://elixir.bootlin.com/linux/v6.12.6/source/block/blk-mq.c#L2987).
The failure to allocate a new req seems to be due to all tags for the queue being used up.

Eventually, this makes it into io_uring’s io_rw_should_reissue and hits same_thread_group(req->tctx->task, current) = false (in https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/io_uring/rw.c#L437). As a result, CQE.res = -EAGAIN and thrown back to the user space program.

Here’s a stack dump when we hit same_thread_group(req->tctx->task, current) = false 

kernel: [237700.098733]  dump_stack_lvl+0x44/0x5c
kernel: [237700.098737]  io_rw_should_reissue.cold+0x5d/0x64
kernel: [237700.098742]  io_complete_rw+0x9a/0xc0
kernel: [237700.098745]  blkdev_bio_end_io_async+0x33/0x80
kernel: [237700.098749]  blk_mq_submit_bio+0x5b5/0x620
kernel: [237700.098756]  submit_bio_noacct_nocheck+0x163/0x370
kernel: [237700.098760]  ? submit_bio_noacct+0x79/0x4b0
kernel: [237700.098764]  nvme_requeue_work+0x4b/0x60 [nvme_core]
kernel: [237700.098776]  process_one_work+0x1c7/0x380
kernel: [237700.098782]  worker_thread+0x4d/0x380
kernel: [237700.098786]  ? _raw_spin_lock_irqsave+0x23/0x50
kernel: [237700.098791]  ? rescuer_thread+0x3a0/0x3a0
kernel: [237700.098794]  kthread+0xe9/0x110
kernel: [237700.098798]  ? kthread_complete_and_exit+0x20/0x20
kernel: [237700.098802]  ret_from_fork+0x22/0x30
kernel: [237700.098811]  </TASK>

Is the same_thread_group() check really needed in this case? The thread groups are certainly different… Any side effects if this check is being removed?

Thanks.

	Michael

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bug? CQE.res = -EAGAIN with nvme multipath driver
  2025-01-06 20:03 Bug? CQE.res = -EAGAIN with nvme multipath driver Haeuptle, Michael
@ 2025-01-06 23:53 ` Jens Axboe
  2025-01-07  2:33   ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2025-01-06 23:53 UTC (permalink / raw)
  To: Haeuptle, Michael, [email protected]

On 1/6/25 1:03 PM, Haeuptle, Michael wrote:
> Hello,
> 
> I?m using the nvme multipath driver (NVMF/RDMA) and io-uring. When a
> path goes away, I sometimes get a CQE.res = -EAGAIN in user space.
> This is unexpected since the nvme multipath driver should handle this
> transparently. It?s somewhat workload related but easy to reproduce
> with fio.
> 
> The multipath driver uses kblockd worker to re-queue the failed NVME
> bios
> (https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/drivers/nvme/host/multipath.c#L126).
> The original request is ended. 
> 
> When the nvme_requeue_work callback is executed, the blk layer tries
> to allocate a new request for the bios but that fails and the bio
> status is set to BLK_STS_AGAIN
> (https://elixir.bootlin.com/linux/v6.12.6/source/block/blk-mq.c#L2987).
> The failure to allocate a new req seems to be due to all tags for the
> queue being used up.
> 
> Eventually, this makes it into io_uring?s io_rw_should_reissue and
> hits same_thread_group(req->tctx->task, current) = false (in
> https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/io_uring/rw.c#L437).
> As a result, CQE.res = -EAGAIN and thrown back to the user space
> program.
> 
> Here?s a stack dump when we hit same_thread_group(req->tctx->task,
> current) = false 
> 
> kernel: [237700.098733]  dump_stack_lvl+0x44/0x5c
> kernel: [237700.098737]  io_rw_should_reissue.cold+0x5d/0x64
> kernel: [237700.098742]  io_complete_rw+0x9a/0xc0
> kernel: [237700.098745]  blkdev_bio_end_io_async+0x33/0x80
> kernel: [237700.098749]  blk_mq_submit_bio+0x5b5/0x620
> kernel: [237700.098756]  submit_bio_noacct_nocheck+0x163/0x370
> kernel: [237700.098760]  ? submit_bio_noacct+0x79/0x4b0
> kernel: [237700.098764]  nvme_requeue_work+0x4b/0x60 [nvme_core]
> kernel: [237700.098776]  process_one_work+0x1c7/0x380
> kernel: [237700.098782]  worker_thread+0x4d/0x380
> kernel: [237700.098786]  ? _raw_spin_lock_irqsave+0x23/0x50
> kernel: [237700.098791]  ? rescuer_thread+0x3a0/0x3a0
> kernel: [237700.098794]  kthread+0xe9/0x110
> kernel: [237700.098798]  ? kthread_complete_and_exit+0x20/0x20
> kernel: [237700.098802]  ret_from_fork+0x22/0x30
> kernel: [237700.098811]  </TASK>
> 
> Is the same_thread_group() check really needed in this case? The
> thread groups are certainly different? Any side effects if this check
> is being removed?

It's their for safety reasons - across all request types, it's not
always safe. For this case, absolutely the check does not need to be
there. So probably best to ponder ways to bypass it selectively.

Let me ponder a bit what the best approach would be here...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bug? CQE.res = -EAGAIN with nvme multipath driver
  2025-01-06 23:53 ` Jens Axboe
@ 2025-01-07  2:33   ` Jens Axboe
  2025-01-07  2:39     ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2025-01-07  2:33 UTC (permalink / raw)
  To: Haeuptle, Michael, [email protected]

On 1/6/25 4:53 PM, Jens Axboe wrote:
> On 1/6/25 1:03 PM, Haeuptle, Michael wrote:
>> Hello,
>>
>> I?m using the nvme multipath driver (NVMF/RDMA) and io-uring. When a
>> path goes away, I sometimes get a CQE.res = -EAGAIN in user space.
>> This is unexpected since the nvme multipath driver should handle this
>> transparently. It?s somewhat workload related but easy to reproduce
>> with fio.
>>
>> The multipath driver uses kblockd worker to re-queue the failed NVME
>> bios
>> (https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/drivers/nvme/host/multipath.c#L126).
>> The original request is ended. 
>>
>> When the nvme_requeue_work callback is executed, the blk layer tries
>> to allocate a new request for the bios but that fails and the bio
>> status is set to BLK_STS_AGAIN
>> (https://elixir.bootlin.com/linux/v6.12.6/source/block/blk-mq.c#L2987).
>> The failure to allocate a new req seems to be due to all tags for the
>> queue being used up.
>>
>> Eventually, this makes it into io_uring?s io_rw_should_reissue and
>> hits same_thread_group(req->tctx->task, current) = false (in
>> https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/io_uring/rw.c#L437).
>> As a result, CQE.res = -EAGAIN and thrown back to the user space
>> program.
>>
>> Here?s a stack dump when we hit same_thread_group(req->tctx->task,
>> current) = false 
>>
>> kernel: [237700.098733]  dump_stack_lvl+0x44/0x5c
>> kernel: [237700.098737]  io_rw_should_reissue.cold+0x5d/0x64
>> kernel: [237700.098742]  io_complete_rw+0x9a/0xc0
>> kernel: [237700.098745]  blkdev_bio_end_io_async+0x33/0x80
>> kernel: [237700.098749]  blk_mq_submit_bio+0x5b5/0x620
>> kernel: [237700.098756]  submit_bio_noacct_nocheck+0x163/0x370
>> kernel: [237700.098760]  ? submit_bio_noacct+0x79/0x4b0
>> kernel: [237700.098764]  nvme_requeue_work+0x4b/0x60 [nvme_core]
>> kernel: [237700.098776]  process_one_work+0x1c7/0x380
>> kernel: [237700.098782]  worker_thread+0x4d/0x380
>> kernel: [237700.098786]  ? _raw_spin_lock_irqsave+0x23/0x50
>> kernel: [237700.098791]  ? rescuer_thread+0x3a0/0x3a0
>> kernel: [237700.098794]  kthread+0xe9/0x110
>> kernel: [237700.098798]  ? kthread_complete_and_exit+0x20/0x20
>> kernel: [237700.098802]  ret_from_fork+0x22/0x30
>> kernel: [237700.098811]  </TASK>
>>
>> Is the same_thread_group() check really needed in this case? The
>> thread groups are certainly different? Any side effects if this check
>> is being removed?
> 
> It's their for safety reasons - across all request types, it's not
> always safe. For this case, absolutely the check does not need to be
> there. So probably best to ponder ways to bypass it selectively.
> 
> Let me ponder a bit what the best approach would be here...

Actually I think we can just remove it. The actual retry will happen out
of context anyway, and the comment about the import is no longer valid
as the import will have been done upfront since 6.10.

Do you want to send a patch for that, or do you want me to send one out
referencing this report?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bug? CQE.res = -EAGAIN with nvme multipath driver
  2025-01-07  2:33   ` Jens Axboe
@ 2025-01-07  2:39     ` Jens Axboe
  2025-01-07 18:12       ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2025-01-07  2:39 UTC (permalink / raw)
  To: Haeuptle, Michael, [email protected]

On 1/6/25 7:33 PM, Jens Axboe wrote:
> On 1/6/25 4:53 PM, Jens Axboe wrote:
>> On 1/6/25 1:03 PM, Haeuptle, Michael wrote:
>>> Hello,
>>>
>>> I?m using the nvme multipath driver (NVMF/RDMA) and io-uring. When a
>>> path goes away, I sometimes get a CQE.res = -EAGAIN in user space.
>>> This is unexpected since the nvme multipath driver should handle this
>>> transparently. It?s somewhat workload related but easy to reproduce
>>> with fio.
>>>
>>> The multipath driver uses kblockd worker to re-queue the failed NVME
>>> bios
>>> (https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/drivers/nvme/host/multipath.c#L126).
>>> The original request is ended. 
>>>
>>> When the nvme_requeue_work callback is executed, the blk layer tries
>>> to allocate a new request for the bios but that fails and the bio
>>> status is set to BLK_STS_AGAIN
>>> (https://elixir.bootlin.com/linux/v6.12.6/source/block/blk-mq.c#L2987).
>>> The failure to allocate a new req seems to be due to all tags for the
>>> queue being used up.
>>>
>>> Eventually, this makes it into io_uring?s io_rw_should_reissue and
>>> hits same_thread_group(req->tctx->task, current) = false (in
>>> https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/io_uring/rw.c#L437).
>>> As a result, CQE.res = -EAGAIN and thrown back to the user space
>>> program.
>>>
>>> Here?s a stack dump when we hit same_thread_group(req->tctx->task,
>>> current) = false 
>>>
>>> kernel: [237700.098733]  dump_stack_lvl+0x44/0x5c
>>> kernel: [237700.098737]  io_rw_should_reissue.cold+0x5d/0x64
>>> kernel: [237700.098742]  io_complete_rw+0x9a/0xc0
>>> kernel: [237700.098745]  blkdev_bio_end_io_async+0x33/0x80
>>> kernel: [237700.098749]  blk_mq_submit_bio+0x5b5/0x620
>>> kernel: [237700.098756]  submit_bio_noacct_nocheck+0x163/0x370
>>> kernel: [237700.098760]  ? submit_bio_noacct+0x79/0x4b0
>>> kernel: [237700.098764]  nvme_requeue_work+0x4b/0x60 [nvme_core]
>>> kernel: [237700.098776]  process_one_work+0x1c7/0x380
>>> kernel: [237700.098782]  worker_thread+0x4d/0x380
>>> kernel: [237700.098786]  ? _raw_spin_lock_irqsave+0x23/0x50
>>> kernel: [237700.098791]  ? rescuer_thread+0x3a0/0x3a0
>>> kernel: [237700.098794]  kthread+0xe9/0x110
>>> kernel: [237700.098798]  ? kthread_complete_and_exit+0x20/0x20
>>> kernel: [237700.098802]  ret_from_fork+0x22/0x30
>>> kernel: [237700.098811]  </TASK>
>>>
>>> Is the same_thread_group() check really needed in this case? The
>>> thread groups are certainly different? Any side effects if this check
>>> is being removed?
>>
>> It's their for safety reasons - across all request types, it's not
>> always safe. For this case, absolutely the check does not need to be
>> there. So probably best to ponder ways to bypass it selectively.
>>
>> Let me ponder a bit what the best approach would be here...
> 
> Actually I think we can just remove it. The actual retry will happen out
> of context anyway, and the comment about the import is no longer valid
> as the import will have been done upfront since 6.10.
> 
> Do you want to send a patch for that, or do you want me to send one out
> referencing this report?

Also see:

commit 039a2e800bcd5beb89909d1a488abf3d647642cf
Author: Jens Axboe <[email protected]>
Date:   Thu Apr 25 09:04:32 2024 -0600

    io_uring/rw: reinstate thread check for retries

let me take a closer look tomorrow...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bug? CQE.res = -EAGAIN with nvme multipath driver
  2025-01-07  2:39     ` Jens Axboe
@ 2025-01-07 18:12       ` Jens Axboe
  2025-01-07 18:24         ` Haeuptle, Michael
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2025-01-07 18:12 UTC (permalink / raw)
  To: Haeuptle, Michael, [email protected]

On 1/6/25 7:39 PM, Jens Axboe wrote:
> On 1/6/25 7:33 PM, Jens Axboe wrote:
>> On 1/6/25 4:53 PM, Jens Axboe wrote:
>>> On 1/6/25 1:03 PM, Haeuptle, Michael wrote:
>>>> Hello,
>>>>
>>>> I?m using the nvme multipath driver (NVMF/RDMA) and io-uring. When a
>>>> path goes away, I sometimes get a CQE.res = -EAGAIN in user space.
>>>> This is unexpected since the nvme multipath driver should handle this
>>>> transparently. It?s somewhat workload related but easy to reproduce
>>>> with fio.
>>>>
>>>> The multipath driver uses kblockd worker to re-queue the failed NVME
>>>> bios
>>>> (https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/drivers/nvme/host/multipath.c#L126).
>>>> The original request is ended. 
>>>>
>>>> When the nvme_requeue_work callback is executed, the blk layer tries
>>>> to allocate a new request for the bios but that fails and the bio
>>>> status is set to BLK_STS_AGAIN
>>>> (https://elixir.bootlin.com/linux/v6.12.6/source/block/blk-mq.c#L2987).
>>>> The failure to allocate a new req seems to be due to all tags for the
>>>> queue being used up.
>>>>
>>>> Eventually, this makes it into io_uring?s io_rw_should_reissue and
>>>> hits same_thread_group(req->tctx->task, current) = false (in
>>>> https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/io_uring/rw.c#L437).
>>>> As a result, CQE.res = -EAGAIN and thrown back to the user space
>>>> program.
>>>>
>>>> Here?s a stack dump when we hit same_thread_group(req->tctx->task,
>>>> current) = false 
>>>>
>>>> kernel: [237700.098733]  dump_stack_lvl+0x44/0x5c
>>>> kernel: [237700.098737]  io_rw_should_reissue.cold+0x5d/0x64
>>>> kernel: [237700.098742]  io_complete_rw+0x9a/0xc0
>>>> kernel: [237700.098745]  blkdev_bio_end_io_async+0x33/0x80
>>>> kernel: [237700.098749]  blk_mq_submit_bio+0x5b5/0x620
>>>> kernel: [237700.098756]  submit_bio_noacct_nocheck+0x163/0x370
>>>> kernel: [237700.098760]  ? submit_bio_noacct+0x79/0x4b0
>>>> kernel: [237700.098764]  nvme_requeue_work+0x4b/0x60 [nvme_core]
>>>> kernel: [237700.098776]  process_one_work+0x1c7/0x380
>>>> kernel: [237700.098782]  worker_thread+0x4d/0x380
>>>> kernel: [237700.098786]  ? _raw_spin_lock_irqsave+0x23/0x50
>>>> kernel: [237700.098791]  ? rescuer_thread+0x3a0/0x3a0
>>>> kernel: [237700.098794]  kthread+0xe9/0x110
>>>> kernel: [237700.098798]  ? kthread_complete_and_exit+0x20/0x20
>>>> kernel: [237700.098802]  ret_from_fork+0x22/0x30
>>>> kernel: [237700.098811]  </TASK>
>>>>
>>>> Is the same_thread_group() check really needed in this case? The
>>>> thread groups are certainly different? Any side effects if this check
>>>> is being removed?
>>>
>>> It's their for safety reasons - across all request types, it's not
>>> always safe. For this case, absolutely the check does not need to be
>>> there. So probably best to ponder ways to bypass it selectively.
>>>
>>> Let me ponder a bit what the best approach would be here...
>>
>> Actually I think we can just remove it. The actual retry will happen out
>> of context anyway, and the comment about the import is no longer valid
>> as the import will have been done upfront since 6.10.
>>
>> Do you want to send a patch for that, or do you want me to send one out
>> referencing this report?
> 
> Also see:
> 
> commit 039a2e800bcd5beb89909d1a488abf3d647642cf
> Author: Jens Axboe <[email protected]>
> Date:   Thu Apr 25 09:04:32 2024 -0600
> 
>     io_uring/rw: reinstate thread check for retries
> 
> let me take a closer look tomorrow...

If you can test a custom kernel, can you give this branch a try?

git://git.kernel.dk/linux.git io_uring-rw-retry

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Bug? CQE.res = -EAGAIN with nvme multipath driver
  2025-01-07 18:12       ` Jens Axboe
@ 2025-01-07 18:24         ` Haeuptle, Michael
  2025-01-07 18:26           ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Haeuptle, Michael @ 2025-01-07 18:24 UTC (permalink / raw)
  To: Jens Axboe, [email protected]

Thanks for the quick response!

When I remove that check on the 6.1.85 kernel version we're using, then it seems that the user space program is losing IOs.
I confirmed this with fio. When we hit this issue, fio never completes and is stuck.

I can certainly try that later kernel with your fix, if you think there are other changes that prevent losing IOs.

-- Michael

-----Original Message-----
From: Jens Axboe <[email protected]> 
Sent: Tuesday, January 7, 2025 11:13 AM
To: Haeuptle, Michael <[email protected]>; [email protected]
Subject: Re: Bug? CQE.res = -EAGAIN with nvme multipath driver

On 1/6/25 7:39 PM, Jens Axboe wrote:
> On 1/6/25 7:33 PM, Jens Axboe wrote:
>> On 1/6/25 4:53 PM, Jens Axboe wrote:
>>> On 1/6/25 1:03 PM, Haeuptle, Michael wrote:
>>>> Hello,
>>>>
>>>> I?m using the nvme multipath driver (NVMF/RDMA) and io-uring. When 
>>>> a path goes away, I sometimes get a CQE.res = -EAGAIN in user space.
>>>> This is unexpected since the nvme multipath driver should handle 
>>>> this transparently. It?s somewhat workload related but easy to 
>>>> reproduce with fio.
>>>>
>>>> The multipath driver uses kblockd worker to re-queue the failed 
>>>> NVME bios 
>>>> (https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/drivers/nvme/host/multipath.c#L126).
>>>> The original request is ended. 
>>>>
>>>> When the nvme_requeue_work callback is executed, the blk layer 
>>>> tries to allocate a new request for the bios but that fails and the 
>>>> bio status is set to BLK_STS_AGAIN 
>>>> (https://elixir.bootlin.com/linux/v6.12.6/source/block/blk-mq.c#L2987).
>>>> The failure to allocate a new req seems to be due to all tags for 
>>>> the queue being used up.
>>>>
>>>> Eventually, this makes it into io_uring?s io_rw_should_reissue and 
>>>> hits same_thread_group(req->tctx->task, current) = false (in 
>>>> https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/io_uring/rw.c#L437).
>>>> As a result, CQE.res = -EAGAIN and thrown back to the user space 
>>>> program.
>>>>
>>>> Here?s a stack dump when we hit same_thread_group(req->tctx->task,
>>>> current) = false
>>>>
>>>> kernel: [237700.098733]  dump_stack_lvl+0x44/0x5c
>>>> kernel: [237700.098737]  io_rw_should_reissue.cold+0x5d/0x64
>>>> kernel: [237700.098742]  io_complete_rw+0x9a/0xc0
>>>> kernel: [237700.098745]  blkdev_bio_end_io_async+0x33/0x80
>>>> kernel: [237700.098749]  blk_mq_submit_bio+0x5b5/0x620
>>>> kernel: [237700.098756]  submit_bio_noacct_nocheck+0x163/0x370
>>>> kernel: [237700.098760]  ? submit_bio_noacct+0x79/0x4b0
>>>> kernel: [237700.098764]  nvme_requeue_work+0x4b/0x60 [nvme_core]
>>>> kernel: [237700.098776]  process_one_work+0x1c7/0x380
>>>> kernel: [237700.098782]  worker_thread+0x4d/0x380
>>>> kernel: [237700.098786]  ? _raw_spin_lock_irqsave+0x23/0x50
>>>> kernel: [237700.098791]  ? rescuer_thread+0x3a0/0x3a0
>>>> kernel: [237700.098794]  kthread+0xe9/0x110
>>>> kernel: [237700.098798]  ? kthread_complete_and_exit+0x20/0x20
>>>> kernel: [237700.098802]  ret_from_fork+0x22/0x30
>>>> kernel: [237700.098811]  </TASK>
>>>>
>>>> Is the same_thread_group() check really needed in this case? The 
>>>> thread groups are certainly different? Any side effects if this 
>>>> check is being removed?
>>>
>>> It's their for safety reasons - across all request types, it's not 
>>> always safe. For this case, absolutely the check does not need to be 
>>> there. So probably best to ponder ways to bypass it selectively.
>>>
>>> Let me ponder a bit what the best approach would be here...
>>
>> Actually I think we can just remove it. The actual retry will happen 
>> out of context anyway, and the comment about the import is no longer 
>> valid as the import will have been done upfront since 6.10.
>>
>> Do you want to send a patch for that, or do you want me to send one 
>> out referencing this report?
> 
> Also see:
> 
> commit 039a2e800bcd5beb89909d1a488abf3d647642cf
> Author: Jens Axboe <[email protected]>
> Date:   Thu Apr 25 09:04:32 2024 -0600
> 
>     io_uring/rw: reinstate thread check for retries
> 
> let me take a closer look tomorrow...

If you can test a custom kernel, can you give this branch a try?

git://git.kernel.dk/linux.git io_uring-rw-retry

--
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bug? CQE.res = -EAGAIN with nvme multipath driver
  2025-01-07 18:24         ` Haeuptle, Michael
@ 2025-01-07 18:26           ` Jens Axboe
  2025-01-09 18:09             ` Haeuptle, Michael
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2025-01-07 18:26 UTC (permalink / raw)
  To: Haeuptle, Michael, [email protected]

On 1/7/25 11:24 AM, Haeuptle, Michael wrote:
> Thanks for the quick response!
> 
> When I remove that check on the 6.1.85 kernel version we're using,
> then it seems that the user space program is losing IOs. I confirmed
> this with fio. When we hit this issue, fio never completes and is
> stuck.

That's because the io_uring logic assumes it happens inline via
submission, and for your case it does not. Which is also why it gets
failed. And hence setting the retry flag in that condition will do
absolutely nothing, as nobody is there to see it.

> I can certainly try that later kernel with your fix, if you think
> there are other changes that prevent losing IOs.

Please try the branch and see how it fares for you.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Bug? CQE.res = -EAGAIN with nvme multipath driver
  2025-01-07 18:26           ` Jens Axboe
@ 2025-01-09 18:09             ` Haeuptle, Michael
  2025-01-10 14:53               ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Haeuptle, Michael @ 2025-01-09 18:09 UTC (permalink / raw)
  To: Jens Axboe, [email protected]

Hey Jens, sorry for the late response.

I was unable to reproduce the issue with your branch. However, I didn't even hit the spot where same_thread_group check was removed.

We backported your changes to 6.1.119 and we did see that our original issue is fixed with your patches.

It seems to me that io_uring performance increased quite a bit in the latest kernel, judging from fio queue utilization of my workload. Maybe that's why I'm not hitting the place where same_thread_group was removed.

Your patch didn't cause any regression after 1d testing in my NVMF/RDMA & multipath setup. So, I think it would be good to get this patch on main.

-- Michael

-----Original Message-----
From: Jens Axboe <[email protected]> 
Sent: Tuesday, January 7, 2025 11:27 AM
To: Haeuptle, Michael <[email protected]>; [email protected]
Subject: Re: Bug? CQE.res = -EAGAIN with nvme multipath driver

On 1/7/25 11:24 AM, Haeuptle, Michael wrote:
> Thanks for the quick response!
> 
> When I remove that check on the 6.1.85 kernel version we're using, 
> then it seems that the user space program is losing IOs. I confirmed 
> this with fio. When we hit this issue, fio never completes and is 
> stuck.

That's because the io_uring logic assumes it happens inline via submission, and for your case it does not. Which is also why it gets failed. And hence setting the retry flag in that condition will do absolutely nothing, as nobody is there to see it.

> I can certainly try that later kernel with your fix, if you think 
> there are other changes that prevent losing IOs.

Please try the branch and see how it fares for you.

--
Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Bug? CQE.res = -EAGAIN with nvme multipath driver
  2025-01-09 18:09             ` Haeuptle, Michael
@ 2025-01-10 14:53               ` Jens Axboe
  0 siblings, 0 replies; 9+ messages in thread
From: Jens Axboe @ 2025-01-10 14:53 UTC (permalink / raw)
  To: Haeuptle, Michael, [email protected]

On 1/9/25 11:09 AM, Haeuptle, Michael wrote:
> Hey Jens, sorry for the late response.
> 
> I was unable to reproduce the issue with your branch. However, I
> didn't even hit the spot where same_thread_group check was removed.

Might be a driver side issue too on the nvme front, if it no longer hits
the retry path as much.

> We backported your changes to 6.1.119 and we did see that our original
> issue is fixed with your patches.
> 
> It seems to me that io_uring performance increased quite a bit in the
> latest kernel, judging from fio queue utilization of my workload.
> Maybe that's why I'm not hitting the place where same_thread_group was
> removed.

We do try and improve performance all the time, but most likely this is
caused by the same effect that reduces the reissue attempts as well.

> Your patch didn't cause any regression after 1d testing in my
> NVMF/RDMA & multipath setup. So, I think it would be good to get this
> patch on main.

Queued up, thanks for testing.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-01-10 14:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-06 20:03 Bug? CQE.res = -EAGAIN with nvme multipath driver Haeuptle, Michael
2025-01-06 23:53 ` Jens Axboe
2025-01-07  2:33   ` Jens Axboe
2025-01-07  2:39     ` Jens Axboe
2025-01-07 18:12       ` Jens Axboe
2025-01-07 18:24         ` Haeuptle, Michael
2025-01-07 18:26           ` Jens Axboe
2025-01-09 18:09             ` Haeuptle, Michael
2025-01-10 14:53               ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox