From: Jens Axboe <[email protected]>
To: "Haeuptle, Michael" <[email protected]>,
"[email protected]" <[email protected]>
Subject: Re: Bug? CQE.res = -EAGAIN with nvme multipath driver
Date: Mon, 6 Jan 2025 16:53:33 -0700 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <DS7PR84MB31105C2C63CFA47BE8CBD6EE95102@DS7PR84MB3110.NAMPRD84.PROD.OUTLOOK.COM>
On 1/6/25 1:03 PM, Haeuptle, Michael wrote:
> Hello,
>
> I?m using the nvme multipath driver (NVMF/RDMA) and io-uring. When a
> path goes away, I sometimes get a CQE.res = -EAGAIN in user space.
> This is unexpected since the nvme multipath driver should handle this
> transparently. It?s somewhat workload related but easy to reproduce
> with fio.
>
> The multipath driver uses kblockd worker to re-queue the failed NVME
> bios
> (https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/drivers/nvme/host/multipath.c#L126).
> The original request is ended.
>
> When the nvme_requeue_work callback is executed, the blk layer tries
> to allocate a new request for the bios but that fails and the bio
> status is set to BLK_STS_AGAIN
> (https://elixir.bootlin.com/linux/v6.12.6/source/block/blk-mq.c#L2987).
> The failure to allocate a new req seems to be due to all tags for the
> queue being used up.
>
> Eventually, this makes it into io_uring?s io_rw_should_reissue and
> hits same_thread_group(req->tctx->task, current) = false (in
> https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/io_uring/rw.c#L437).
> As a result, CQE.res = -EAGAIN and thrown back to the user space
> program.
>
> Here?s a stack dump when we hit same_thread_group(req->tctx->task,
> current) = false
>
> kernel: [237700.098733] dump_stack_lvl+0x44/0x5c
> kernel: [237700.098737] io_rw_should_reissue.cold+0x5d/0x64
> kernel: [237700.098742] io_complete_rw+0x9a/0xc0
> kernel: [237700.098745] blkdev_bio_end_io_async+0x33/0x80
> kernel: [237700.098749] blk_mq_submit_bio+0x5b5/0x620
> kernel: [237700.098756] submit_bio_noacct_nocheck+0x163/0x370
> kernel: [237700.098760] ? submit_bio_noacct+0x79/0x4b0
> kernel: [237700.098764] nvme_requeue_work+0x4b/0x60 [nvme_core]
> kernel: [237700.098776] process_one_work+0x1c7/0x380
> kernel: [237700.098782] worker_thread+0x4d/0x380
> kernel: [237700.098786] ? _raw_spin_lock_irqsave+0x23/0x50
> kernel: [237700.098791] ? rescuer_thread+0x3a0/0x3a0
> kernel: [237700.098794] kthread+0xe9/0x110
> kernel: [237700.098798] ? kthread_complete_and_exit+0x20/0x20
> kernel: [237700.098802] ret_from_fork+0x22/0x30
> kernel: [237700.098811] </TASK>
>
> Is the same_thread_group() check really needed in this case? The
> thread groups are certainly different? Any side effects if this check
> is being removed?
It's their for safety reasons - across all request types, it's not
always safe. For this case, absolutely the check does not need to be
there. So probably best to ponder ways to bypass it selectively.
Let me ponder a bit what the best approach would be here...
--
Jens Axboe
next prev parent reply other threads:[~2025-01-06 23:53 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-06 20:03 Bug? CQE.res = -EAGAIN with nvme multipath driver Haeuptle, Michael
2025-01-06 23:53 ` Jens Axboe [this message]
2025-01-07 2:33 ` Jens Axboe
2025-01-07 2:39 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox