public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] nvme: optimize passthrough IOPOLL completion for local ring context
@ 2026-01-15  8:59 Ming Lei
  2026-01-15 18:21 ` Keith Busch
  0 siblings, 1 reply; 3+ messages in thread
From: Ming Lei @ 2026-01-15  8:59 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-block, linux-nvme, Ming Lei

When multiple io_uring rings poll on the same NVMe queue, one ring can
find completions belonging to another ring. The current code always
uses task_work to handle this, but this adds overhead for the common
single-ring case.

This patch passes the polling io_ring_ctx through the iopoll callback
chain via io_comp_batch and stores it in the request. In the NVMe
end_io handler, we compare the polling context with the request's
owning context. If they match (local), we complete inline. If they
differ (remote) or it's a non-IOPOLL path, we use task_work as before.

Changes:
- Add poll_ctx field to struct io_comp_batch
- Add poll_ctx to struct request's hash/ipi_list union
- Set iob.poll_ctx in io_do_iopoll() before calling iopoll callbacks
- Store poll_ctx in request in nvme_ns_chr_uring_cmd_iopoll()
- Check local vs remote context in nvme_uring_cmd_end_io()

~10% IOPS improvement is observed in the following benchmark:

fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B[0|1] -O0 -P1 -u1 -n1 /dev/ng0n1

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/ioctl.c | 36 ++++++++++++++++++++++++++++--------
 include/linux/blk-mq.h    |  4 +++-
 include/linux/blkdev.h    |  1 +
 io_uring/rw.c             |  7 +++++++
 4 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index a9c097dacad6..0b85378f7fbb 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -425,14 +425,28 @@ static enum rq_end_io_ret nvme_uring_cmd_end_io(struct request *req,
 	pdu->result = le64_to_cpu(nvme_req(req)->result.u64);
 
 	/*
-	 * IOPOLL could potentially complete this request directly, but
-	 * if multiple rings are polling on the same queue, then it's possible
-	 * for one ring to find completions for another ring. Punting the
-	 * completion via task_work will always direct it to the right
-	 * location, rather than potentially complete requests for ringA
-	 * under iopoll invocations from ringB.
+	 * For IOPOLL, check if this completion is happening in the context
+	 * of the same io_ring that owns the request (local context). If so,
+	 * we can complete inline without task_work overhead. Otherwise, we
+	 * must punt to task_work to ensure completion happens in the correct
+	 * ring's context.
 	 */
-	io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_cb);
+	if (blk_rq_is_poll(req) && req->poll_ctx == io_uring_cmd_ctx_handle(ioucmd)) {
+		/*
+		 * Local context: the polling ring owns this request.
+		 * Complete inline for optimal performance.
+		 */
+		if (pdu->bio)
+			blk_rq_unmap_user(pdu->bio);
+		io_uring_cmd_done32(ioucmd, pdu->status, pdu->result, 0);
+	} else {
+		/*
+		 * Remote or non-IOPOLL context: either a different ring found
+		 * this completion, or this is IRQ/softirq completion. Use
+		 * task_work to direct completion to the correct location.
+		 */
+		io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_cb);
+	}
 	return RQ_END_IO_FREE;
 }
 
@@ -677,8 +691,14 @@ int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd,
 	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
 	struct request *req = pdu->req;
 
-	if (req && blk_rq_is_poll(req))
+	if (req && blk_rq_is_poll(req)) {
+		/*
+		 * Store the polling context in the request so end_io can
+		 * detect if it's completing in the local ring's context.
+		 */
+		req->poll_ctx = iob ? iob->poll_ctx : NULL;
 		return blk_rq_poll(req, iob, poll_flags);
+	}
 	return 0;
 }
 #ifdef CONFIG_NVME_MULTIPATH
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index cae9e857aea4..1975f5dd29f8 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -175,11 +175,13 @@ struct request {
 	 * request reaches the dispatch list. The ipi_list is only used
 	 * to queue the request for softirq completion, which is long
 	 * after the request has been unhashed (and even removed from
-	 * the dispatch list).
+	 * the dispatch list). poll_ctx is used during iopoll to track
+	 * the io_ring_ctx that initiated the poll operation.
 	 */
 	union {
 		struct hlist_node hash;	/* merge hash */
 		struct llist_node ipi_list;
+		void *poll_ctx;		/* iopoll context */
 	};
 
 	/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 72e34acd439c..4ed708912127 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1820,6 +1820,7 @@ void bdev_fput(struct file *bdev_file);
 
 struct io_comp_batch {
 	struct rq_list req_list;
+	void *poll_ctx;
 	bool need_ts;
 	void (*complete)(struct io_comp_batch *);
 };
diff --git a/io_uring/rw.c b/io_uring/rw.c
index c33c533a267e..27a49ce3de46 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1321,6 +1321,13 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 	struct io_kiocb *req, *tmp;
 	int nr_events = 0;
 
+	/*
+	 * Store the polling ctx so drivers can detect if they're completing
+	 * a request from the same ring that's polling (local) vs a different
+	 * ring (remote). This enables optimizations for local completions.
+	 */
+	iob.poll_ctx = ctx;
+
 	/*
 	 * Only spin for completions if we don't have multiple devices hanging
 	 * off our complete list.
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] nvme: optimize passthrough IOPOLL completion for local ring context
  2026-01-15  8:59 [PATCH] nvme: optimize passthrough IOPOLL completion for local ring context Ming Lei
@ 2026-01-15 18:21 ` Keith Busch
  2026-01-16  2:29   ` Ming Lei
  0 siblings, 1 reply; 3+ messages in thread
From: Keith Busch @ 2026-01-15 18:21 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, io-uring, linux-block, linux-nvme

On Thu, Jan 15, 2026 at 04:59:52PM +0800, Ming Lei wrote:
> +	if (blk_rq_is_poll(req) && req->poll_ctx == io_uring_cmd_ctx_handle(ioucmd)) {

...

> @@ -677,8 +691,14 @@ int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd,
>  	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>  	struct request *req = pdu->req;
>  
> -	if (req && blk_rq_is_poll(req))
> +	if (req && blk_rq_is_poll(req)) {
> +		/*
> +		 * Store the polling context in the request so end_io can
> +		 * detect if it's completing in the local ring's context.
> +		 */
> +		req->poll_ctx = iob ? iob->poll_ctx : NULL;

I don't think this works. The io_uring polling always polls from a
single ctx's iopoll_list, so it's redundant to store the ctx in the iob
since it will always match the ctx of the ioucmd passed in.

Which then leads to the check at the top: if req->poll_ctx was ever set,
then it should always match its ioucmd ctx too, right? If it was set
once before, but the polling didn't find the completion, then another
ctx polling does find it, we won't complete it in the iouring task as
needed.

I think you want to save off ctx that called
'nvme_ns_chr_uring_cmd_iopoll()', but there doesn't seem to be an
immediate way to refer back to that from 'nvme_uring_cmd_end_io'. Maybe
stash it in current->io_uring->last instead, then check if
io_uring_cmd_ctx_handle(ioucmd)) equals that.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] nvme: optimize passthrough IOPOLL completion for local ring context
  2026-01-15 18:21 ` Keith Busch
@ 2026-01-16  2:29   ` Ming Lei
  0 siblings, 0 replies; 3+ messages in thread
From: Ming Lei @ 2026-01-16  2:29 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jens Axboe, io-uring, linux-block, linux-nvme

On Thu, Jan 15, 2026 at 11:21:41AM -0700, Keith Busch wrote:
> On Thu, Jan 15, 2026 at 04:59:52PM +0800, Ming Lei wrote:
> > +	if (blk_rq_is_poll(req) && req->poll_ctx == io_uring_cmd_ctx_handle(ioucmd)) {
> 
> ...
> 
> > @@ -677,8 +691,14 @@ int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd,
> >  	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> >  	struct request *req = pdu->req;
> >  
> > -	if (req && blk_rq_is_poll(req))
> > +	if (req && blk_rq_is_poll(req)) {
> > +		/*
> > +		 * Store the polling context in the request so end_io can
> > +		 * detect if it's completing in the local ring's context.
> > +		 */
> > +		req->poll_ctx = iob ? iob->poll_ctx : NULL;
> 
> I don't think this works. The io_uring polling always polls from a
> single ctx's iopoll_list, so it's redundant to store the ctx in the iob
> since it will always match the ctx of the ioucmd passed in.

Yeah, the patch looks totally wrong, what it should record is the
io_ring_ctx for completing the request, instead of the context which owns
the uring_cmd, which can be obtained always from the uring_cmd.

> 
> Which then leads to the check at the top: if req->poll_ctx was ever set,
> then it should always match its ioucmd ctx too, right? If it was set
> once before, but the polling didn't find the completion, then another
> ctx polling does find it, we won't complete it in the iouring task as
> needed.
> 
> I think you want to save off ctx that called
> 'nvme_ns_chr_uring_cmd_iopoll()', but there doesn't seem to be an
> immediate way to refer back to that from 'nvme_uring_cmd_end_io'. Maybe
> stash it in current->io_uring->last instead, then check if
> io_uring_cmd_ctx_handle(ioucmd)) equals that.

One easy way is to add `iob` to rq_end_io_fn, then pass it via
blk_mq_end_request_batch().


Thanks,
Ming


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-01-16  2:29 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-15  8:59 [PATCH] nvme: optimize passthrough IOPOLL completion for local ring context Ming Lei
2026-01-15 18:21 ` Keith Busch
2026-01-16  2:29   ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox