RE: Bug? CQE.res = -EAGAIN with nvme multipath driver

public inbox for [email protected]
 help / color / mirror / Atom feed

From: "Haeuptle, Michael" <[email protected]>
To: Jens Axboe <[email protected]>,
	"[email protected]" <[email protected]>
Subject: RE: Bug? CQE.res = -EAGAIN with nvme multipath driver
Date: Tue, 7 Jan 2025 18:24:33 +0000	[thread overview]
Message-ID: <IA1PR84MB310838E47FDAAFD543B8239A95112@IA1PR84MB3108.NAMPRD84.PROD.OUTLOOK.COM> (raw)
In-Reply-To: <[email protected]>

Thanks for the quick response!

When I remove that check on the 6.1.85 kernel version we're using, then it seems that the user space program is losing IOs.
I confirmed this with fio. When we hit this issue, fio never completes and is stuck.

I can certainly try that later kernel with your fix, if you think there are other changes that prevent losing IOs.

-- Michael

-----Original Message-----
From: Jens Axboe <[email protected]> 
Sent: Tuesday, January 7, 2025 11:13 AM
To: Haeuptle, Michael <[email protected]>; [email protected]
Subject: Re: Bug? CQE.res = -EAGAIN with nvme multipath driver

On 1/6/25 7:39 PM, Jens Axboe wrote:
> On 1/6/25 7:33 PM, Jens Axboe wrote:
>> On 1/6/25 4:53 PM, Jens Axboe wrote:
>>> On 1/6/25 1:03 PM, Haeuptle, Michael wrote:
>>>> Hello,
>>>>
>>>> I?m using the nvme multipath driver (NVMF/RDMA) and io-uring. When 
>>>> a path goes away, I sometimes get a CQE.res = -EAGAIN in user space.
>>>> This is unexpected since the nvme multipath driver should handle 
>>>> this transparently. It?s somewhat workload related but easy to 
>>>> reproduce with fio.
>>>>
>>>> The multipath driver uses kblockd worker to re-queue the failed 
>>>> NVME bios 
>>>> (https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/drivers/nvme/host/multipath.c#L126).
>>>> The original request is ended. 
>>>>
>>>> When the nvme_requeue_work callback is executed, the blk layer 
>>>> tries to allocate a new request for the bios but that fails and the 
>>>> bio status is set to BLK_STS_AGAIN 
>>>> (https://elixir.bootlin.com/linux/v6.12.6/source/block/blk-mq.c#L2987).
>>>> The failure to allocate a new req seems to be due to all tags for 
>>>> the queue being used up.
>>>>
>>>> Eventually, this makes it into io_uring?s io_rw_should_reissue and 
>>>> hits same_thread_group(req->tctx->task, current) = false (in 
>>>> https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/io_uring/rw.c#L437).
>>>> As a result, CQE.res = -EAGAIN and thrown back to the user space 
>>>> program.
>>>>
>>>> Here?s a stack dump when we hit same_thread_group(req->tctx->task,
>>>> current) = false
>>>>
>>>> kernel: [237700.098733]  dump_stack_lvl+0x44/0x5c
>>>> kernel: [237700.098737]  io_rw_should_reissue.cold+0x5d/0x64
>>>> kernel: [237700.098742]  io_complete_rw+0x9a/0xc0
>>>> kernel: [237700.098745]  blkdev_bio_end_io_async+0x33/0x80
>>>> kernel: [237700.098749]  blk_mq_submit_bio+0x5b5/0x620
>>>> kernel: [237700.098756]  submit_bio_noacct_nocheck+0x163/0x370
>>>> kernel: [237700.098760]  ? submit_bio_noacct+0x79/0x4b0
>>>> kernel: [237700.098764]  nvme_requeue_work+0x4b/0x60 [nvme_core]
>>>> kernel: [237700.098776]  process_one_work+0x1c7/0x380
>>>> kernel: [237700.098782]  worker_thread+0x4d/0x380
>>>> kernel: [237700.098786]  ? _raw_spin_lock_irqsave+0x23/0x50
>>>> kernel: [237700.098791]  ? rescuer_thread+0x3a0/0x3a0
>>>> kernel: [237700.098794]  kthread+0xe9/0x110
>>>> kernel: [237700.098798]  ? kthread_complete_and_exit+0x20/0x20
>>>> kernel: [237700.098802]  ret_from_fork+0x22/0x30
>>>> kernel: [237700.098811]  </TASK>
>>>>
>>>> Is the same_thread_group() check really needed in this case? The 
>>>> thread groups are certainly different? Any side effects if this 
>>>> check is being removed?
>>>
>>> It's their for safety reasons - across all request types, it's not 
>>> always safe. For this case, absolutely the check does not need to be 
>>> there. So probably best to ponder ways to bypass it selectively.
>>>
>>> Let me ponder a bit what the best approach would be here...
>>
>> Actually I think we can just remove it. The actual retry will happen 
>> out of context anyway, and the comment about the import is no longer 
>> valid as the import will have been done upfront since 6.10.
>>
>> Do you want to send a patch for that, or do you want me to send one 
>> out referencing this report?
> 
> Also see:
> 
> commit 039a2e800bcd5beb89909d1a488abf3d647642cf
> Author: Jens Axboe <[email protected]>
> Date:   Thu Apr 25 09:04:32 2024 -0600
> 
>     io_uring/rw: reinstate thread check for retries
> 
> let me take a closer look tomorrow...

If you can test a custom kernel, can you give this branch a try?

git://git.kernel.dk/linux.git io_uring-rw-retry

--
Jens Axboe

next prev parent reply	other threads:[~2025-01-07 18:24 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-06 20:03 Bug? CQE.res = -EAGAIN with nvme multipath driver Haeuptle, Michael
2025-01-06 23:53 ` Jens Axboe
2025-01-07  2:33   ` Jens Axboe
2025-01-07  2:39     ` Jens Axboe
2025-01-07 18:12       ` Jens Axboe
2025-01-07 18:24         ` Haeuptle, Michael [this message]
2025-01-07 18:26           ` Jens Axboe
2025-01-09 18:09             ` Haeuptle, Michael
2025-01-10 14:53               ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=IA1PR84MB310838E47FDAAFD543B8239A95112@IA1PR84MB3108.NAMPRD84.PROD.OUTLOOK.COM \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox