don't reorder requests passed to ->queue

public inbox for [email protected]
 help / color / mirror / Atom feed

* don't reorder requests passed to ->queue_rqs
@ 2024-11-13 15:20 Christoph Hellwig
  2024-11-13 15:20 ` [PATCH 1/6] nvme-pci: reverse request order in nvme_queue_rqs Christoph Hellwig
                   ` (8 more replies)
  0 siblings, 9 replies; 26+ messages in thread
From: Christoph Hellwig @ 2024-11-13 15:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, linux-block, virtualization, linux-nvme, io-uring

Hi Jens,

currently blk-mq reorders requests when adding them to the plug because
the request list can't do efficient tail appends.  When the plug is
directly issued using ->queue_rqs that means reordered requests are
passed to the driver, which can lead to very bad I/O patterns when
not corrected, especially on rotational devices (e.g. NVMe HDD) or
when using zone append.

This series first adds two easily backportable workarounds to reverse
the reording in the virtio_blk and nvme-pci ->queue_rq implementations
similar to what the non-queue_rqs path does, and then adds a rq_list
type that allows for efficient tail insertions and uses that to fix
the reordering for real and then does the same for I/O completions as
well.

Diffstat:
 block/blk-core.c              |    6 +-
 block/blk-merge.c             |    2 
 block/blk-mq.c                |   42 ++++++++---------
 block/blk-mq.h                |    2 
 drivers/block/null_blk/main.c |    9 +--
 drivers/block/virtio_blk.c    |   53 ++++++++++------------
 drivers/nvme/host/apple.c     |    2 
 drivers/nvme/host/pci.c       |   46 ++++++++-----------
 include/linux/blk-mq.h        |   99 ++++++++++++++++++++----------------------
 include/linux/blkdev.h        |   11 +++-
 io_uring/rw.c                 |    4 -
 11 files changed, 133 insertions(+), 143 deletions(-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/6] nvme-pci: reverse request order in nvme_queue_rqs
  2024-11-13 15:20 don't reorder requests passed to ->queue_rqs Christoph Hellwig
@ 2024-11-13 15:20 ` Christoph Hellwig
  2024-11-13 19:10   ` Keith Busch
  2024-11-13 15:20 ` [PATCH 2/6] virtio_blk: reverse request order in virtio_queue_rqs Christoph Hellwig
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2024-11-13 15:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, linux-block, virtualization, linux-nvme, io-uring

blk_mq_flush_plug_list submits requests in the reverse order that they
were submitted, which leads to a rather suboptimal I/O pattern especially
in rotational devices.  Fix this by rewriting nvme_queue_rqs so that it
always pops the requests from the passed in request list, and then adds
them to the head of a local submit list.  This actually simplifies the
code a bit as it removes the complicated list splicing, at the cost of
extra updates of the rq_next pointer.  As that should be cache hot
anyway it should be an easy price to pay.

Fixes: d62cbcf62f2f ("nvme: add support for mq_ops->queue_rqs()")
Signed-off-by: Christoph Hellwig <[email protected]>
---
 drivers/nvme/host/pci.c | 39 +++++++++++++++++----------------------
 1 file changed, 17 insertions(+), 22 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4b9fda0b1d9a..c66ea2c23690 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -904,9 +904,10 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct request **rqlist)
 {
+	struct request *req;
+
 	spin_lock(&nvmeq->sq_lock);
-	while (!rq_list_empty(*rqlist)) {
-		struct request *req = rq_list_pop(rqlist);
+	while ((req = rq_list_pop(rqlist))) {
 		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 
 		nvme_sq_copy_cmd(nvmeq, &iod->cmd);
@@ -931,31 +932,25 @@ static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
 
 static void nvme_queue_rqs(struct request **rqlist)
 {
-	struct request *req, *next, *prev = NULL;
+	struct request *submit_list = NULL;
 	struct request *requeue_list = NULL;
+	struct request **requeue_lastp = &requeue_list;
+	struct nvme_queue *nvmeq = NULL;
+	struct request *req;
 
-	rq_list_for_each_safe(rqlist, req, next) {
-		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
-
-		if (!nvme_prep_rq_batch(nvmeq, req)) {
-			/* detach 'req' and add to remainder list */
-			rq_list_move(rqlist, &requeue_list, req, prev);
-
-			req = prev;
-			if (!req)
-				continue;
-		}
+	while ((req = rq_list_pop(rqlist))) {
+		if (nvmeq && nvmeq != req->mq_hctx->driver_data)
+			nvme_submit_cmds(nvmeq, &submit_list);
+		nvmeq = req->mq_hctx->driver_data;
 
-		if (!next || req->mq_hctx != next->mq_hctx) {
-			/* detach rest of list, and submit */
-			req->rq_next = NULL;
-			nvme_submit_cmds(nvmeq, rqlist);
-			*rqlist = next;
-			prev = NULL;
-		} else
-			prev = req;
+		if (nvme_prep_rq_batch(nvmeq, req))
+			rq_list_add(&submit_list, req); /* reverse order */
+		else
+			rq_list_add_tail(&requeue_lastp, req);
 	}
 
+	if (nvmeq)
+		nvme_submit_cmds(nvmeq, &submit_list);
 	*rqlist = requeue_list;
 }
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] nvme-pci: reverse request order in nvme_queue_rqs
  2024-11-13 15:20 ` [PATCH 1/6] nvme-pci: reverse request order in nvme_queue_rqs Christoph Hellwig
@ 2024-11-13 19:10   ` Keith Busch
  0 siblings, 0 replies; 26+ messages in thread
From: Keith Busch @ 2024-11-13 19:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Sagi Grimberg,
	Pavel Begunkov, linux-block, virtualization, linux-nvme, io-uring

On Wed, Nov 13, 2024 at 04:20:41PM +0100, Christoph Hellwig wrote:
> blk_mq_flush_plug_list submits requests in the reverse order that they
> were submitted, which leads to a rather suboptimal I/O pattern especially
> in rotational devices.  Fix this by rewriting nvme_queue_rqs so that it
> always pops the requests from the passed in request list, and then adds
> them to the head of a local submit list.  This actually simplifies the
> code a bit as it removes the complicated list splicing, at the cost of
> extra updates of the rq_next pointer.  As that should be cache hot
> anyway it should be an easy price to pay.

Looks good.

Reviewed-by: Keith Busch <[email protected]>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/6] virtio_blk: reverse request order in virtio_queue_rqs
  2024-11-13 15:20 don't reorder requests passed to ->queue_rqs Christoph Hellwig
  2024-11-13 15:20 ` [PATCH 1/6] nvme-pci: reverse request order in nvme_queue_rqs Christoph Hellwig
@ 2024-11-13 15:20 ` Christoph Hellwig
  2024-11-13 19:03   ` Keith Busch
  2024-11-13 23:25   ` Michael S. Tsirkin
  2024-11-13 15:20 ` [PATCH 3/6] block: remove rq_list_move Christoph Hellwig
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 26+ messages in thread
From: Christoph Hellwig @ 2024-11-13 15:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, linux-block, virtualization, linux-nvme, io-uring

blk_mq_flush_plug_list submits requests in the reverse order that they
were submitted, which leads to a rather suboptimal I/O pattern especially
in rotational devices.  Fix this by rewriting nvme_queue_rqs so that it
always pops the requests from the passed in request list, and then adds
them to the head of a local submit list.  This actually simplifies the
code a bit as it removes the complicated list splicing, at the cost of
extra updates of the rq_next pointer.  As that should be cache hot
anyway it should be an easy price to pay.

Fixes: 0e9911fa768f ("virtio-blk: support mq_ops->queue_rqs()")
Signed-off-by: Christoph Hellwig <[email protected]>
---
 drivers/block/virtio_blk.c | 46 +++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 25 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 0e99a4714928..b25f7c06a28e 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -471,18 +471,18 @@ static bool virtblk_prep_rq_batch(struct request *req)
 	return virtblk_prep_rq(req->mq_hctx, vblk, req, vbr) == BLK_STS_OK;
 }
 
-static bool virtblk_add_req_batch(struct virtio_blk_vq *vq,
+static void virtblk_add_req_batch(struct virtio_blk_vq *vq,
 					struct request **rqlist)
 {
+	struct request *req;
 	unsigned long flags;
-	int err;
 	bool kick;
 
 	spin_lock_irqsave(&vq->lock, flags);
 
-	while (!rq_list_empty(*rqlist)) {
-		struct request *req = rq_list_pop(rqlist);
+	while ((req = rq_list_pop(rqlist))) {
 		struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
+		int err;
 
 		err = virtblk_add_req(vq->vq, vbr);
 		if (err) {
@@ -495,37 +495,33 @@ static bool virtblk_add_req_batch(struct virtio_blk_vq *vq,
 	kick = virtqueue_kick_prepare(vq->vq);
 	spin_unlock_irqrestore(&vq->lock, flags);
 
-	return kick;
+	if (kick)
+		virtqueue_notify(vq->vq);
 }
 
 static void virtio_queue_rqs(struct request **rqlist)
 {
-	struct request *req, *next, *prev = NULL;
+	struct request *submit_list = NULL;
 	struct request *requeue_list = NULL;
+	struct request **requeue_lastp = &requeue_list;
+	struct virtio_blk_vq *vq = NULL;
+	struct request *req;
 
-	rq_list_for_each_safe(rqlist, req, next) {
-		struct virtio_blk_vq *vq = get_virtio_blk_vq(req->mq_hctx);
-		bool kick;
-
-		if (!virtblk_prep_rq_batch(req)) {
-			rq_list_move(rqlist, &requeue_list, req, prev);
-			req = prev;
-			if (!req)
-				continue;
-		}
+	while ((req = rq_list_pop(rqlist))) {
+		struct virtio_blk_vq *this_vq = get_virtio_blk_vq(req->mq_hctx);
 
-		if (!next || req->mq_hctx != next->mq_hctx) {
-			req->rq_next = NULL;
-			kick = virtblk_add_req_batch(vq, rqlist);
-			if (kick)
-				virtqueue_notify(vq->vq);
+		if (vq && vq != this_vq)
+			virtblk_add_req_batch(vq, &submit_list);
+		vq = this_vq;
 
-			*rqlist = next;
-			prev = NULL;
-		} else
-			prev = req;
+		if (virtblk_prep_rq_batch(req))
+			rq_list_add(&submit_list, req); /* reverse order */
+		else
+			rq_list_add_tail(&requeue_lastp, req);
 	}
 
+	if (vq)
+		virtblk_add_req_batch(vq, &submit_list);
 	*rqlist = requeue_list;
 }
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] virtio_blk: reverse request order in virtio_queue_rqs
  2024-11-13 15:20 ` [PATCH 2/6] virtio_blk: reverse request order in virtio_queue_rqs Christoph Hellwig
@ 2024-11-13 19:03   ` Keith Busch
  2024-11-13 19:05     ` Jens Axboe
  2024-11-13 23:25   ` Michael S. Tsirkin
  1 sibling, 1 reply; 26+ messages in thread
From: Keith Busch @ 2024-11-13 19:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Sagi Grimberg,
	Pavel Begunkov, linux-block, virtualization, linux-nvme, io-uring

On Wed, Nov 13, 2024 at 04:20:42PM +0100, Christoph Hellwig wrote:
> in rotational devices.  Fix this by rewriting nvme_queue_rqs so that it

You copied this commit message from the previous one for the nvme
driver. This message should say "virtio_queue_rqs".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] virtio_blk: reverse request order in virtio_queue_rqs
  2024-11-13 19:03   ` Keith Busch
@ 2024-11-13 19:05     ` Jens Axboe
  0 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2024-11-13 19:05 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: Michael S. Tsirkin, Jason Wang, Sagi Grimberg, Pavel Begunkov,
	linux-block, virtualization, linux-nvme, io-uring

On 11/13/24 12:03 PM, Keith Busch wrote:
> On Wed, Nov 13, 2024 at 04:20:42PM +0100, Christoph Hellwig wrote:
>> in rotational devices.  Fix this by rewriting nvme_queue_rqs so that it
> 
> You copied this commit message from the previous one for the nvme
> driver. This message should say "virtio_queue_rqs".

I fixed it up. As well as the annoying "two spaces after a period", that
should just go away.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] virtio_blk: reverse request order in virtio_queue_rqs
  2024-11-13 15:20 ` [PATCH 2/6] virtio_blk: reverse request order in virtio_queue_rqs Christoph Hellwig
  2024-11-13 19:03   ` Keith Busch
@ 2024-11-13 23:25   ` Michael S. Tsirkin
  1 sibling, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2024-11-13 23:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, linux-block, virtualization, linux-nvme, io-uring

On Wed, Nov 13, 2024 at 04:20:42PM +0100, Christoph Hellwig wrote:
> blk_mq_flush_plug_list submits requests in the reverse order that they
> were submitted, which leads to a rather suboptimal I/O pattern especially
> in rotational devices.  Fix this by rewriting nvme_queue_rqs so that it
> always pops the requests from the passed in request list, and then adds
> them to the head of a local submit list.  This actually simplifies the
> code a bit as it removes the complicated list splicing, at the cost of
> extra updates of the rq_next pointer.  As that should be cache hot
> anyway it should be an easy price to pay.
> 
> Fixes: 0e9911fa768f ("virtio-blk: support mq_ops->queue_rqs()")
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
>  drivers/block/virtio_blk.c | 46 +++++++++++++++++---------------------
>  1 file changed, 21 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 0e99a4714928..b25f7c06a28e 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -471,18 +471,18 @@ static bool virtblk_prep_rq_batch(struct request *req)
>  	return virtblk_prep_rq(req->mq_hctx, vblk, req, vbr) == BLK_STS_OK;
>  }
>  
> -static bool virtblk_add_req_batch(struct virtio_blk_vq *vq,
> +static void virtblk_add_req_batch(struct virtio_blk_vq *vq,
>  					struct request **rqlist)
>  {
> +	struct request *req;
>  	unsigned long flags;
> -	int err;
>  	bool kick;
>  
>  	spin_lock_irqsave(&vq->lock, flags);
>  
> -	while (!rq_list_empty(*rqlist)) {
> -		struct request *req = rq_list_pop(rqlist);
> +	while ((req = rq_list_pop(rqlist))) {
>  		struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
> +		int err;
>  
>  		err = virtblk_add_req(vq->vq, vbr);
>  		if (err) {
> @@ -495,37 +495,33 @@ static bool virtblk_add_req_batch(struct virtio_blk_vq *vq,
>  	kick = virtqueue_kick_prepare(vq->vq);
>  	spin_unlock_irqrestore(&vq->lock, flags);
>  
> -	return kick;
> +	if (kick)
> +		virtqueue_notify(vq->vq);
>  }
>  
>  static void virtio_queue_rqs(struct request **rqlist)
>  {
> -	struct request *req, *next, *prev = NULL;
> +	struct request *submit_list = NULL;
>  	struct request *requeue_list = NULL;
> +	struct request **requeue_lastp = &requeue_list;
> +	struct virtio_blk_vq *vq = NULL;
> +	struct request *req;
>  
> -	rq_list_for_each_safe(rqlist, req, next) {
> -		struct virtio_blk_vq *vq = get_virtio_blk_vq(req->mq_hctx);
> -		bool kick;
> -
> -		if (!virtblk_prep_rq_batch(req)) {
> -			rq_list_move(rqlist, &requeue_list, req, prev);
> -			req = prev;
> -			if (!req)
> -				continue;
> -		}
> +	while ((req = rq_list_pop(rqlist))) {
> +		struct virtio_blk_vq *this_vq = get_virtio_blk_vq(req->mq_hctx);
>  
> -		if (!next || req->mq_hctx != next->mq_hctx) {
> -			req->rq_next = NULL;
> -			kick = virtblk_add_req_batch(vq, rqlist);
> -			if (kick)
> -				virtqueue_notify(vq->vq);
> +		if (vq && vq != this_vq)
> +			virtblk_add_req_batch(vq, &submit_list);
> +		vq = this_vq;
>  
> -			*rqlist = next;
> -			prev = NULL;
> -		} else
> -			prev = req;
> +		if (virtblk_prep_rq_batch(req))
> +			rq_list_add(&submit_list, req); /* reverse order */
> +		else
> +			rq_list_add_tail(&requeue_lastp, req);
>  	}
>  
> +	if (vq)
> +		virtblk_add_req_batch(vq, &submit_list);
>  	*rqlist = requeue_list;
>  }



looks ok from virtio POV

Acked-by: Michael S. Tsirkin <[email protected]>


> -- 
> 2.45.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/6] block: remove rq_list_move
  2024-11-13 15:20 don't reorder requests passed to ->queue_rqs Christoph Hellwig
  2024-11-13 15:20 ` [PATCH 1/6] nvme-pci: reverse request order in nvme_queue_rqs Christoph Hellwig
  2024-11-13 15:20 ` [PATCH 2/6] virtio_blk: reverse request order in virtio_queue_rqs Christoph Hellwig
@ 2024-11-13 15:20 ` Christoph Hellwig
  2024-11-13 15:20 ` [PATCH 4/6] block: add a rq_list type Christoph Hellwig
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2024-11-13 15:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, linux-block, virtualization, linux-nvme, io-uring

Unused now.

Signed-off-by: Christoph Hellwig <[email protected]>
---
 include/linux/blk-mq.h | 17 -----------------
 1 file changed, 17 deletions(-)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index a28264442948..ad26a41d13f9 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -268,23 +268,6 @@ static inline unsigned short req_get_ioprio(struct request *req)
 #define rq_list_next(rq)	(rq)->rq_next
 #define rq_list_empty(list)	((list) == (struct request *) NULL)
 
-/**
- * rq_list_move() - move a struct request from one list to another
- * @src: The source list @rq is currently in
- * @dst: The destination list that @rq will be appended to
- * @rq: The request to move
- * @prev: The request preceding @rq in @src (NULL if @rq is the head)
- */
-static inline void rq_list_move(struct request **src, struct request **dst,
-				struct request *rq, struct request *prev)
-{
-	if (prev)
-		prev->rq_next = rq->rq_next;
-	else
-		*src = rq->rq_next;
-	rq_list_add(dst, rq);
-}
-
 /**
  * enum blk_eh_timer_return - How the timeout handler should proceed
  * @BLK_EH_DONE: The block driver completed the command or will complete it at
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/6] block: add a rq_list type
  2024-11-13 15:20 don't reorder requests passed to ->queue_rqs Christoph Hellwig
                   ` (2 preceding siblings ...)
  2024-11-13 15:20 ` [PATCH 3/6] block: remove rq_list_move Christoph Hellwig
@ 2024-11-13 15:20 ` Christoph Hellwig
  2024-11-14 20:11   ` Nathan Chancellor
  2024-11-13 15:20 ` [PATCH 5/6] block: don't reorder requests in blk_add_rq_to_plug Christoph Hellwig
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2024-11-13 15:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, linux-block, virtualization, linux-nvme, io-uring

Replace the semi-open coded request list helpers with a proper rq_list
type that mirrors the bio_list and has head and tail pointers.  Besides
better type safety this actually allows to insert at the tail of the
list, which will be useful soon.

Signed-off-by: Christoph Hellwig <[email protected]>
---
 block/blk-core.c              |  6 +--
 block/blk-merge.c             |  2 +-
 block/blk-mq.c                | 40 ++++++++--------
 block/blk-mq.h                |  2 +-
 drivers/block/null_blk/main.c |  9 ++--
 drivers/block/virtio_blk.c    | 13 +++---
 drivers/nvme/host/apple.c     |  2 +-
 drivers/nvme/host/pci.c       | 15 +++---
 include/linux/blk-mq.h        | 88 +++++++++++++++++++++--------------
 include/linux/blkdev.h        | 11 +++--
 io_uring/rw.c                 |  4 +-
 11 files changed, 104 insertions(+), 88 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0387172e8259..666efe8fa202 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1120,8 +1120,8 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios)
 		return;
 
 	plug->cur_ktime = 0;
-	plug->mq_list = NULL;
-	plug->cached_rq = NULL;
+	rq_list_init(&plug->mq_list);
+	rq_list_init(&plug->cached_rqs);
 	plug->nr_ios = min_t(unsigned short, nr_ios, BLK_MAX_REQUEST_COUNT);
 	plug->rq_count = 0;
 	plug->multiple_queues = false;
@@ -1217,7 +1217,7 @@ void __blk_flush_plug(struct blk_plug *plug, bool from_schedule)
 	 * queue for cached requests, we don't want a blocked task holding
 	 * up a queue freeze/quiesce event.
 	 */
-	if (unlikely(!rq_list_empty(plug->cached_rq)))
+	if (unlikely(!rq_list_empty(&plug->cached_rqs)))
 		blk_mq_free_plug_rqs(plug);
 
 	plug->cur_ktime = 0;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index df36f83f3738..e0b28e9298c9 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -1179,7 +1179,7 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 	struct blk_plug *plug = current->plug;
 	struct request *rq;
 
-	if (!plug || rq_list_empty(plug->mq_list))
+	if (!plug || rq_list_empty(&plug->mq_list))
 		return false;
 
 	rq_list_for_each(&plug->mq_list, rq) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3c6cadba75e3..ff0b819e35fc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -478,7 +478,7 @@ __blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data)
 		prefetch(tags->static_rqs[tag]);
 		tag_mask &= ~(1UL << i);
 		rq = blk_mq_rq_ctx_init(data, tags, tag);
-		rq_list_add(data->cached_rq, rq);
+		rq_list_add_head(data->cached_rqs, rq);
 		nr++;
 	}
 	if (!(data->rq_flags & RQF_SCHED_TAGS))
@@ -487,7 +487,7 @@ __blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data)
 	percpu_ref_get_many(&data->q->q_usage_counter, nr - 1);
 	data->nr_tags -= nr;
 
-	return rq_list_pop(data->cached_rq);
+	return rq_list_pop(data->cached_rqs);
 }
 
 static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
@@ -584,7 +584,7 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
 		.flags		= flags,
 		.cmd_flags	= opf,
 		.nr_tags	= plug->nr_ios,
-		.cached_rq	= &plug->cached_rq,
+		.cached_rqs	= &plug->cached_rqs,
 	};
 	struct request *rq;
 
@@ -609,14 +609,14 @@ static struct request *blk_mq_alloc_cached_request(struct request_queue *q,
 	if (!plug)
 		return NULL;
 
-	if (rq_list_empty(plug->cached_rq)) {
+	if (rq_list_empty(&plug->cached_rqs)) {
 		if (plug->nr_ios == 1)
 			return NULL;
 		rq = blk_mq_rq_cache_fill(q, plug, opf, flags);
 		if (!rq)
 			return NULL;
 	} else {
-		rq = rq_list_peek(&plug->cached_rq);
+		rq = rq_list_peek(&plug->cached_rqs);
 		if (!rq || rq->q != q)
 			return NULL;
 
@@ -625,7 +625,7 @@ static struct request *blk_mq_alloc_cached_request(struct request_queue *q,
 		if (op_is_flush(rq->cmd_flags) != op_is_flush(opf))
 			return NULL;
 
-		plug->cached_rq = rq_list_next(rq);
+		rq_list_pop(&plug->cached_rqs);
 		blk_mq_rq_time_init(rq, blk_time_get_ns());
 	}
 
@@ -802,7 +802,7 @@ void blk_mq_free_plug_rqs(struct blk_plug *plug)
 {
 	struct request *rq;
 
-	while ((rq = rq_list_pop(&plug->cached_rq)) != NULL)
+	while ((rq = rq_list_pop(&plug->cached_rqs)) != NULL)
 		blk_mq_free_request(rq);
 }
 
@@ -1392,8 +1392,7 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
 	 */
 	if (!plug->has_elevator && (rq->rq_flags & RQF_SCHED_TAGS))
 		plug->has_elevator = true;
-	rq->rq_next = NULL;
-	rq_list_add(&plug->mq_list, rq);
+	rq_list_add_head(&plug->mq_list, rq);
 	plug->rq_count++;
 }
 
@@ -2785,7 +2784,7 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug)
 	blk_status_t ret = BLK_STS_OK;
 
 	while ((rq = rq_list_pop(&plug->mq_list))) {
-		bool last = rq_list_empty(plug->mq_list);
+		bool last = rq_list_empty(&plug->mq_list);
 
 		if (hctx != rq->mq_hctx) {
 			if (hctx) {
@@ -2828,8 +2827,7 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
 {
 	struct blk_mq_hw_ctx *this_hctx = NULL;
 	struct blk_mq_ctx *this_ctx = NULL;
-	struct request *requeue_list = NULL;
-	struct request **requeue_lastp = &requeue_list;
+	struct rq_list requeue_list = {};
 	unsigned int depth = 0;
 	bool is_passthrough = false;
 	LIST_HEAD(list);
@@ -2843,12 +2841,12 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
 			is_passthrough = blk_rq_is_passthrough(rq);
 		} else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx ||
 			   is_passthrough != blk_rq_is_passthrough(rq)) {
-			rq_list_add_tail(&requeue_lastp, rq);
+			rq_list_add_tail(&requeue_list, rq);
 			continue;
 		}
 		list_add(&rq->queuelist, &list);
 		depth++;
-	} while (!rq_list_empty(plug->mq_list));
+	} while (!rq_list_empty(&plug->mq_list));
 
 	plug->mq_list = requeue_list;
 	trace_block_unplug(this_hctx->queue, depth, !from_sched);
@@ -2903,19 +2901,19 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 		if (q->mq_ops->queue_rqs) {
 			blk_mq_run_dispatch_ops(q,
 				__blk_mq_flush_plug_list(q, plug));
-			if (rq_list_empty(plug->mq_list))
+			if (rq_list_empty(&plug->mq_list))
 				return;
 		}
 
 		blk_mq_run_dispatch_ops(q,
 				blk_mq_plug_issue_direct(plug));
-		if (rq_list_empty(plug->mq_list))
+		if (rq_list_empty(&plug->mq_list))
 			return;
 	}
 
 	do {
 		blk_mq_dispatch_plug_list(plug, from_schedule);
-	} while (!rq_list_empty(plug->mq_list));
+	} while (!rq_list_empty(&plug->mq_list));
 }
 
 static void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
@@ -2980,7 +2978,7 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
 	if (plug) {
 		data.nr_tags = plug->nr_ios;
 		plug->nr_ios = 1;
-		data.cached_rq = &plug->cached_rq;
+		data.cached_rqs = &plug->cached_rqs;
 	}
 
 	rq = __blk_mq_alloc_requests(&data);
@@ -3003,7 +3001,7 @@ static struct request *blk_mq_peek_cached_request(struct blk_plug *plug,
 
 	if (!plug)
 		return NULL;
-	rq = rq_list_peek(&plug->cached_rq);
+	rq = rq_list_peek(&plug->cached_rqs);
 	if (!rq || rq->q != q)
 		return NULL;
 	if (type != rq->mq_hctx->type &&
@@ -3017,14 +3015,14 @@ static struct request *blk_mq_peek_cached_request(struct blk_plug *plug,
 static void blk_mq_use_cached_rq(struct request *rq, struct blk_plug *plug,
 		struct bio *bio)
 {
-	WARN_ON_ONCE(rq_list_peek(&plug->cached_rq) != rq);
+	if (rq_list_pop(&plug->cached_rqs) != rq)
+		WARN_ON_ONCE(1);
 
 	/*
 	 * If any qos ->throttle() end up blocking, we will have flushed the
 	 * plug and hence killed the cached_rq list as well. Pop this entry
 	 * before we throttle.
 	 */
-	plug->cached_rq = rq_list_next(rq);
 	rq_qos_throttle(rq->q, bio);
 
 	blk_mq_rq_time_init(rq, blk_time_get_ns());
diff --git a/block/blk-mq.h b/block/blk-mq.h
index f4ac1af77a26..89a20fffa4b1 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -155,7 +155,7 @@ struct blk_mq_alloc_data {
 
 	/* allocate multiple requests/tags in one go */
 	unsigned int nr_tags;
-	struct request **cached_rq;
+	struct rq_list *cached_rqs;
 
 	/* input & output parameter */
 	struct blk_mq_ctx *ctx;
diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 2f0431e42c49..3c3d8d200abb 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -1638,10 +1638,9 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_STS_OK;
 }
 
-static void null_queue_rqs(struct request **rqlist)
+static void null_queue_rqs(struct rq_list *rqlist)
 {
-	struct request *requeue_list = NULL;
-	struct request **requeue_lastp = &requeue_list;
+	struct rq_list requeue_list = {};
 	struct blk_mq_queue_data bd = { };
 	blk_status_t ret;
 
@@ -1651,8 +1650,8 @@ static void null_queue_rqs(struct request **rqlist)
 		bd.rq = rq;
 		ret = null_queue_rq(rq->mq_hctx, &bd);
 		if (ret != BLK_STS_OK)
-			rq_list_add_tail(&requeue_lastp, rq);
-	} while (!rq_list_empty(*rqlist));
+			rq_list_add_tail(&requeue_list, rq);
+	} while (!rq_list_empty(rqlist));
 
 	*rqlist = requeue_list;
 }
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index b25f7c06a28e..a19f24c19140 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -472,7 +472,7 @@ static bool virtblk_prep_rq_batch(struct request *req)
 }
 
 static void virtblk_add_req_batch(struct virtio_blk_vq *vq,
-					struct request **rqlist)
+		struct rq_list *rqlist)
 {
 	struct request *req;
 	unsigned long flags;
@@ -499,11 +499,10 @@ static void virtblk_add_req_batch(struct virtio_blk_vq *vq,
 		virtqueue_notify(vq->vq);
 }
 
-static void virtio_queue_rqs(struct request **rqlist)
+static void virtio_queue_rqs(struct rq_list *rqlist)
 {
-	struct request *submit_list = NULL;
-	struct request *requeue_list = NULL;
-	struct request **requeue_lastp = &requeue_list;
+	struct rq_list submit_list = { };
+	struct rq_list requeue_list = { };
 	struct virtio_blk_vq *vq = NULL;
 	struct request *req;
 
@@ -515,9 +514,9 @@ static void virtio_queue_rqs(struct request **rqlist)
 		vq = this_vq;
 
 		if (virtblk_prep_rq_batch(req))
-			rq_list_add(&submit_list, req); /* reverse order */
+			rq_list_add_head(&submit_list, req); /* reverse order */
 		else
-			rq_list_add_tail(&requeue_lastp, req);
+			rq_list_add_tail(&requeue_list, req);
 	}
 
 	if (vq)
diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
index b1387dc459a3..7cd1102a8d2c 100644
--- a/drivers/nvme/host/apple.c
+++ b/drivers/nvme/host/apple.c
@@ -649,7 +649,7 @@ static bool apple_nvme_handle_cq(struct apple_nvme_queue *q, bool force)
 
 	found = apple_nvme_poll_cq(q, &iob);
 
-	if (!rq_list_empty(iob.req_list))
+	if (!rq_list_empty(&iob.req_list))
 		apple_nvme_complete_batch(&iob);
 
 	return found;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c66ea2c23690..c3a00a13946d 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -902,7 +902,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_STS_OK;
 }
 
-static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct request **rqlist)
+static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct rq_list *rqlist)
 {
 	struct request *req;
 
@@ -930,11 +930,10 @@ static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
 	return nvme_prep_rq(nvmeq->dev, req) == BLK_STS_OK;
 }
 
-static void nvme_queue_rqs(struct request **rqlist)
+static void nvme_queue_rqs(struct rq_list *rqlist)
 {
-	struct request *submit_list = NULL;
-	struct request *requeue_list = NULL;
-	struct request **requeue_lastp = &requeue_list;
+	struct rq_list submit_list = { };
+	struct rq_list requeue_list = { };
 	struct nvme_queue *nvmeq = NULL;
 	struct request *req;
 
@@ -944,9 +943,9 @@ static void nvme_queue_rqs(struct request **rqlist)
 		nvmeq = req->mq_hctx->driver_data;
 
 		if (nvme_prep_rq_batch(nvmeq, req))
-			rq_list_add(&submit_list, req); /* reverse order */
+			rq_list_add_head(&submit_list, req); /* reverse order */
 		else
-			rq_list_add_tail(&requeue_lastp, req);
+			rq_list_add_tail(&requeue_list, req);
 	}
 
 	if (nvmeq)
@@ -1078,7 +1077,7 @@ static irqreturn_t nvme_irq(int irq, void *data)
 	DEFINE_IO_COMP_BATCH(iob);
 
 	if (nvme_poll_cq(nvmeq, &iob)) {
-		if (!rq_list_empty(iob.req_list))
+		if (!rq_list_empty(&iob.req_list))
 			nvme_pci_complete_batch(&iob);
 		return IRQ_HANDLED;
 	}
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index ad26a41d13f9..c61e04365677 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -229,44 +229,60 @@ static inline unsigned short req_get_ioprio(struct request *req)
 #define rq_dma_dir(rq) \
 	(op_is_write(req_op(rq)) ? DMA_TO_DEVICE : DMA_FROM_DEVICE)
 
-#define rq_list_add(listptr, rq)	do {		\
-	(rq)->rq_next = *(listptr);			\
-	*(listptr) = rq;				\
-} while (0)
-
-#define rq_list_add_tail(lastpptr, rq)	do {		\
-	(rq)->rq_next = NULL;				\
-	**(lastpptr) = rq;				\
-	*(lastpptr) = &rq->rq_next;			\
-} while (0)
-
-#define rq_list_pop(listptr)				\
-({							\
-	struct request *__req = NULL;			\
-	if ((listptr) && *(listptr))	{		\
-		__req = *(listptr);			\
-		*(listptr) = __req->rq_next;		\
-	}						\
-	__req;						\
-})
+static inline int rq_list_empty(const struct rq_list *rl)
+{
+	return rl->head == NULL;
+}
 
-#define rq_list_peek(listptr)				\
-({							\
-	struct request *__req = NULL;			\
-	if ((listptr) && *(listptr))			\
-		__req = *(listptr);			\
-	__req;						\
-})
+static inline void rq_list_init(struct rq_list *rl)
+{
+	rl->head = NULL;
+	rl->tail = NULL;
+}
+
+static inline void rq_list_add_tail(struct rq_list *rl, struct request *rq)
+{
+	rq->rq_next = NULL;
+	if (rl->tail)
+		rl->tail->rq_next = rq;
+	else
+		rl->head = rq;
+	rl->tail = rq;
+}
+
+static inline void rq_list_add_head(struct rq_list *rl, struct request *rq)
+{
+	rq->rq_next = rl->head;
+	rl->head = rq;
+	if (!rl->tail)
+		rl->tail = rq;
+}
+
+static inline struct request *rq_list_pop(struct rq_list *rl)
+{
+	struct request *rq = rl->head;
+
+	if (rq) {
+		rl->head = rl->head->rq_next;
+		if (!rl->head)
+			rl->tail = NULL;
+		rq->rq_next = NULL;
+	}
+
+	return rq;
+}
 
-#define rq_list_for_each(listptr, pos)			\
-	for (pos = rq_list_peek((listptr)); pos; pos = rq_list_next(pos))
+static inline struct request *rq_list_peek(struct rq_list *rl)
+{
+	return rl->head;
+}
 
-#define rq_list_for_each_safe(listptr, pos, nxt)			\
-	for (pos = rq_list_peek((listptr)), nxt = rq_list_next(pos);	\
-		pos; pos = nxt, nxt = pos ? rq_list_next(pos) : NULL)
+#define rq_list_for_each(rl, pos)					\
+	for (pos = rq_list_peek((rl)); (pos); pos = pos->rq_next)
 
-#define rq_list_next(rq)	(rq)->rq_next
-#define rq_list_empty(list)	((list) == (struct request *) NULL)
+#define rq_list_for_each_safe(rl, pos, nxt)				\
+	for (pos = rq_list_peek((rl)), nxt = pos->rq_next;		\
+		pos; pos = nxt, nxt = pos ? pos->rq_next : NULL)
 
 /**
  * enum blk_eh_timer_return - How the timeout handler should proceed
@@ -559,7 +575,7 @@ struct blk_mq_ops {
 	 * empty the @rqlist completely, then the rest will be queued
 	 * individually by the block layer upon return.
 	 */
-	void (*queue_rqs)(struct request **rqlist);
+	void (*queue_rqs)(struct rq_list *rqlist);
 
 	/**
 	 * @get_budget: Reserve budget before queue request, once .queue_rq is
@@ -868,7 +884,7 @@ static inline bool blk_mq_add_to_batch(struct request *req,
 	else if (iob->complete != complete)
 		return false;
 	iob->need_ts |= blk_mq_need_time_stamp(req);
-	rq_list_add(&iob->req_list, req);
+	rq_list_add_head(&iob->req_list, req);
 	return true;
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 65f37ae70712..ce8b65503ff0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1006,6 +1006,11 @@ extern void blk_put_queue(struct request_queue *);
 void blk_mark_disk_dead(struct gendisk *disk);
 
 #ifdef CONFIG_BLOCK
+struct rq_list {
+	struct request *head;
+	struct request *tail;
+};
+
 /*
  * blk_plug permits building a queue of related requests by holding the I/O
  * fragments for a short period. This allows merging of sequential requests
@@ -1018,10 +1023,10 @@ void blk_mark_disk_dead(struct gendisk *disk);
  * blk_flush_plug() is called.
  */
 struct blk_plug {
-	struct request *mq_list; /* blk-mq requests */
+	struct rq_list mq_list; /* blk-mq requests */
 
 	/* if ios_left is > 1, we can batch tag/rq allocations */
-	struct request *cached_rq;
+	struct rq_list cached_rqs;
 	u64 cur_ktime;
 	unsigned short nr_ios;
 
@@ -1683,7 +1688,7 @@ int bdev_thaw(struct block_device *bdev);
 void bdev_fput(struct file *bdev_file);
 
 struct io_comp_batch {
-	struct request *req_list;
+	struct rq_list req_list;
 	bool need_ts;
 	void (*complete)(struct io_comp_batch *);
 };
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 354c4e175654..9daef985543e 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1160,12 +1160,12 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 			poll_flags |= BLK_POLL_ONESHOT;
 
 		/* iopoll may have completed current req */
-		if (!rq_list_empty(iob.req_list) ||
+		if (!rq_list_empty(&iob.req_list) ||
 		    READ_ONCE(req->iopoll_completed))
 			break;
 	}
 
-	if (!rq_list_empty(iob.req_list))
+	if (!rq_list_empty(&iob.req_list))
 		iob.complete(&iob);
 	else if (!pos)
 		return 0;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] block: add a rq_list type
  2024-11-13 15:20 ` [PATCH 4/6] block: add a rq_list type Christoph Hellwig
@ 2024-11-14 20:11   ` Nathan Chancellor
  2024-11-15  9:10     ` Christoph Hellwig
  2024-11-15 12:49     ` Jens Axboe
  0 siblings, 2 replies; 26+ messages in thread
From: Nathan Chancellor @ 2024-11-14 20:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Keith Busch,
	Sagi Grimberg, Pavel Begunkov, linux-block, virtualization,
	linux-nvme, io-uring

Hi Christoph,

On Wed, Nov 13, 2024 at 04:20:44PM +0100, Christoph Hellwig wrote:
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 65f37ae70712..ce8b65503ff0 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1006,6 +1006,11 @@ extern void blk_put_queue(struct request_queue *);
>  void blk_mark_disk_dead(struct gendisk *disk);
>  
>  #ifdef CONFIG_BLOCK
> +struct rq_list {
> +	struct request *head;
> +	struct request *tail;
> +};
> +
>  /*
>   * blk_plug permits building a queue of related requests by holding the I/O
>   * fragments for a short period. This allows merging of sequential requests
> @@ -1018,10 +1023,10 @@ void blk_mark_disk_dead(struct gendisk *disk);
>   * blk_flush_plug() is called.
>   */
>  struct blk_plug {
> -	struct request *mq_list; /* blk-mq requests */
> +	struct rq_list mq_list; /* blk-mq requests */
>  
>  	/* if ios_left is > 1, we can batch tag/rq allocations */
> -	struct request *cached_rq;
> +	struct rq_list cached_rqs;
>  	u64 cur_ktime;
>  	unsigned short nr_ios;
>  
> @@ -1683,7 +1688,7 @@ int bdev_thaw(struct block_device *bdev);
>  void bdev_fput(struct file *bdev_file);
>  
>  struct io_comp_batch {
> -	struct request *req_list;
> +	struct rq_list req_list;

This change as commit a3396b99990d ("block: add a rq_list type") in
next-20241114 causes errors when CONFIG_BLOCK is disabled because the
definition of 'struct rq_list' is under CONFIG_BLOCK. Should it be moved
out?

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 00212e96261a..a1fd0ddce5cf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1006,12 +1006,12 @@ extern void blk_put_queue(struct request_queue *);
 
 void blk_mark_disk_dead(struct gendisk *disk);
 
-#ifdef CONFIG_BLOCK
 struct rq_list {
 	struct request *head;
 	struct request *tail;
 };
 
+#ifdef CONFIG_BLOCK
 /*
  * blk_plug permits building a queue of related requests by holding the I/O
  * fragments for a short period. This allows merging of sequential requests

>  	bool need_ts;
>  	void (*complete)(struct io_comp_batch *);
>  };

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] block: add a rq_list type
  2024-11-14 20:11   ` Nathan Chancellor
@ 2024-11-15  9:10     ` Christoph Hellwig
  2024-11-15 12:49     ` Jens Axboe
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2024-11-15  9:10 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Christoph Hellwig, Jens Axboe, Michael S. Tsirkin, Jason Wang,
	Keith Busch, Sagi Grimberg, Pavel Begunkov, linux-block,
	virtualization, linux-nvme, io-uring

On Thu, Nov 14, 2024 at 01:11:03PM -0700, Nathan Chancellor wrote:
> This change as commit a3396b99990d ("block: add a rq_list type") in
> next-20241114 causes errors when CONFIG_BLOCK is disabled because the
> definition of 'struct rq_list' is under CONFIG_BLOCK. Should it be moved
> out?

Yes, this looks good:

Reviewed-by: Christoph Hellwig <[email protected]>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] block: add a rq_list type
  2024-11-14 20:11   ` Nathan Chancellor
  2024-11-15  9:10     ` Christoph Hellwig
@ 2024-11-15 12:49     ` Jens Axboe
  2024-11-15 19:38       ` Jens Axboe
  1 sibling, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-11-15 12:49 UTC (permalink / raw)
  To: Nathan Chancellor, Christoph Hellwig
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, linux-block, virtualization, linux-nvme, io-uring

On 11/14/24 1:11 PM, Nathan Chancellor wrote:
> Hi Christoph,
> 
> On Wed, Nov 13, 2024 at 04:20:44PM +0100, Christoph Hellwig wrote:
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 65f37ae70712..ce8b65503ff0 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1006,6 +1006,11 @@ extern void blk_put_queue(struct request_queue *);
>>  void blk_mark_disk_dead(struct gendisk *disk);
>>  
>>  #ifdef CONFIG_BLOCK
>> +struct rq_list {
>> +	struct request *head;
>> +	struct request *tail;
>> +};
>> +
>>  /*
>>   * blk_plug permits building a queue of related requests by holding the I/O
>>   * fragments for a short period. This allows merging of sequential requests
>> @@ -1018,10 +1023,10 @@ void blk_mark_disk_dead(struct gendisk *disk);
>>   * blk_flush_plug() is called.
>>   */
>>  struct blk_plug {
>> -	struct request *mq_list; /* blk-mq requests */
>> +	struct rq_list mq_list; /* blk-mq requests */
>>  
>>  	/* if ios_left is > 1, we can batch tag/rq allocations */
>> -	struct request *cached_rq;
>> +	struct rq_list cached_rqs;
>>  	u64 cur_ktime;
>>  	unsigned short nr_ios;
>>  
>> @@ -1683,7 +1688,7 @@ int bdev_thaw(struct block_device *bdev);
>>  void bdev_fput(struct file *bdev_file);
>>  
>>  struct io_comp_batch {
>> -	struct request *req_list;
>> +	struct rq_list req_list;
> 
> This change as commit a3396b99990d ("block: add a rq_list type") in
> next-20241114 causes errors when CONFIG_BLOCK is disabled because the
> definition of 'struct rq_list' is under CONFIG_BLOCK. Should it be moved
> out?
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 00212e96261a..a1fd0ddce5cf 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1006,12 +1006,12 @@ extern void blk_put_queue(struct request_queue *);
>  
>  void blk_mark_disk_dead(struct gendisk *disk);
>  
> -#ifdef CONFIG_BLOCK
>  struct rq_list {
>  	struct request *head;
>  	struct request *tail;
>  };
>  
> +#ifdef CONFIG_BLOCK
>  /*
>   * blk_plug permits building a queue of related requests by holding the I/O
>   * fragments for a short period. This allows merging of sequential requests
> 

Fix looks fine, but I can't apply a patch that hasn't been signed off.
Please send one, or I'll just have to sort it out manually as we're
really close to this code shipping.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] block: add a rq_list type
  2024-11-15 12:49     ` Jens Axboe
@ 2024-11-15 19:38       ` Jens Axboe
  2024-11-16  0:56         ` Nathan Chancellor
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-11-15 19:38 UTC (permalink / raw)
  To: Nathan Chancellor, Christoph Hellwig
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, linux-block, virtualization, linux-nvme, io-uring

On 11/15/24 5:49 AM, Jens Axboe wrote:
> On 11/14/24 1:11 PM, Nathan Chancellor wrote:
>> Hi Christoph,
>>
>> On Wed, Nov 13, 2024 at 04:20:44PM +0100, Christoph Hellwig wrote:
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index 65f37ae70712..ce8b65503ff0 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -1006,6 +1006,11 @@ extern void blk_put_queue(struct request_queue *);
>>>  void blk_mark_disk_dead(struct gendisk *disk);
>>>  
>>>  #ifdef CONFIG_BLOCK
>>> +struct rq_list {
>>> +	struct request *head;
>>> +	struct request *tail;
>>> +};
>>> +
>>>  /*
>>>   * blk_plug permits building a queue of related requests by holding the I/O
>>>   * fragments for a short period. This allows merging of sequential requests
>>> @@ -1018,10 +1023,10 @@ void blk_mark_disk_dead(struct gendisk *disk);
>>>   * blk_flush_plug() is called.
>>>   */
>>>  struct blk_plug {
>>> -	struct request *mq_list; /* blk-mq requests */
>>> +	struct rq_list mq_list; /* blk-mq requests */
>>>  
>>>  	/* if ios_left is > 1, we can batch tag/rq allocations */
>>> -	struct request *cached_rq;
>>> +	struct rq_list cached_rqs;
>>>  	u64 cur_ktime;
>>>  	unsigned short nr_ios;
>>>  
>>> @@ -1683,7 +1688,7 @@ int bdev_thaw(struct block_device *bdev);
>>>  void bdev_fput(struct file *bdev_file);
>>>  
>>>  struct io_comp_batch {
>>> -	struct request *req_list;
>>> +	struct rq_list req_list;
>>
>> This change as commit a3396b99990d ("block: add a rq_list type") in
>> next-20241114 causes errors when CONFIG_BLOCK is disabled because the
>> definition of 'struct rq_list' is under CONFIG_BLOCK. Should it be moved
>> out?
>>
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 00212e96261a..a1fd0ddce5cf 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1006,12 +1006,12 @@ extern void blk_put_queue(struct request_queue *);
>>  
>>  void blk_mark_disk_dead(struct gendisk *disk);
>>  
>> -#ifdef CONFIG_BLOCK
>>  struct rq_list {
>>  	struct request *head;
>>  	struct request *tail;
>>  };
>>  
>> +#ifdef CONFIG_BLOCK
>>  /*
>>   * blk_plug permits building a queue of related requests by holding the I/O
>>   * fragments for a short period. This allows merging of sequential requests
>>
> 
> Fix looks fine, but I can't apply a patch that hasn't been signed off.
> Please send one, or I'll just have to sort it out manually as we're
> really close to this code shipping.

I fixed it up myself, it's too close to me sending out 6.13 changes
to let it linger:

https://git.kernel.dk/cgit/linux/commit/?h=for-6.13/block&id=957860cbc1dc89f79f2acc193470224e350dfd03

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] block: add a rq_list type
  2024-11-15 19:38       ` Jens Axboe
@ 2024-11-16  0:56         ` Nathan Chancellor
  0 siblings, 0 replies; 26+ messages in thread
From: Nathan Chancellor @ 2024-11-16  0:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Michael S. Tsirkin, Jason Wang, Keith Busch,
	Sagi Grimberg, Pavel Begunkov, linux-block, virtualization,
	linux-nvme, io-uring

On Fri, Nov 15, 2024 at 12:38:46PM -0700, Jens Axboe wrote:
> On 11/15/24 5:49 AM, Jens Axboe wrote:
> > Fix looks fine, but I can't apply a patch that hasn't been signed off.
> > Please send one, or I'll just have to sort it out manually as we're
> > really close to this code shipping.
> 
> I fixed it up myself, it's too close to me sending out 6.13 changes
> to let it linger:
> 
> https://git.kernel.dk/cgit/linux/commit/?h=for-6.13/block&id=957860cbc1dc89f79f2acc193470224e350dfd03

Thanks, I was going to send a patch amidst my travels today but by the
time I got to it, I saw you had already tackled it on your branch.

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 5/6] block: don't reorder requests in blk_add_rq_to_plug
  2024-11-13 15:20 don't reorder requests passed to ->queue_rqs Christoph Hellwig
                   ` (3 preceding siblings ...)
  2024-11-13 15:20 ` [PATCH 4/6] block: add a rq_list type Christoph Hellwig
@ 2024-11-13 15:20 ` Christoph Hellwig
  2024-11-13 15:20 ` [PATCH 6/6] block: don't reorder requests in blk_mq_add_to_batch Christoph Hellwig
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2024-11-13 15:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, linux-block, virtualization, linux-nvme, io-uring

Add requests to the tail of the list instead of the front so that they
are queued up in submission order.

Remove the re-reordering in blk_mq_dispatch_plug_list, virtio_queue_rqs
and nvme_queue_rqs now that the list is ordered as expected.

Signed-off-by: Christoph Hellwig <[email protected]>
---
 block/blk-mq.c             | 4 ++--
 drivers/block/virtio_blk.c | 2 +-
 drivers/nvme/host/pci.c    | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ff0b819e35fc..270cfd9fc6b0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1392,7 +1392,7 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
 	 */
 	if (!plug->has_elevator && (rq->rq_flags & RQF_SCHED_TAGS))
 		plug->has_elevator = true;
-	rq_list_add_head(&plug->mq_list, rq);
+	rq_list_add_tail(&plug->mq_list, rq);
 	plug->rq_count++;
 }
 
@@ -2844,7 +2844,7 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
 			rq_list_add_tail(&requeue_list, rq);
 			continue;
 		}
-		list_add(&rq->queuelist, &list);
+		list_add_tail(&rq->queuelist, &list);
 		depth++;
 	} while (!rq_list_empty(&plug->mq_list));
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index a19f24c19140..c0cdba71f436 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -514,7 +514,7 @@ static void virtio_queue_rqs(struct rq_list *rqlist)
 		vq = this_vq;
 
 		if (virtblk_prep_rq_batch(req))
-			rq_list_add_head(&submit_list, req); /* reverse order */
+			rq_list_add_tail(&submit_list, req);
 		else
 			rq_list_add_tail(&requeue_list, req);
 	}
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c3a00a13946d..39ba1d0c287d 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -943,7 +943,7 @@ static void nvme_queue_rqs(struct rq_list *rqlist)
 		nvmeq = req->mq_hctx->driver_data;
 
 		if (nvme_prep_rq_batch(nvmeq, req))
-			rq_list_add_head(&submit_list, req); /* reverse order */
+			rq_list_add_tail(&submit_list, req);
 		else
 			rq_list_add_tail(&requeue_list, req);
 	}
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/6] block: don't reorder requests in blk_mq_add_to_batch
  2024-11-13 15:20 don't reorder requests passed to ->queue_rqs Christoph Hellwig
                   ` (4 preceding siblings ...)
  2024-11-13 15:20 ` [PATCH 5/6] block: don't reorder requests in blk_add_rq_to_plug Christoph Hellwig
@ 2024-11-13 15:20 ` Christoph Hellwig
  2024-11-13 18:33 ` don't reorder requests passed to ->queue_rqs Bart Van Assche
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2024-11-13 15:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, linux-block, virtualization, linux-nvme, io-uring

LIFO ordering for batched completions is a bit unexpected and also
defeats some merging optimizations in e.g. the XFS buffered write
code.  Now that we can easily add the request to the tail of the list
do that.

Signed-off-by: Christoph Hellwig <[email protected]>
---
 include/linux/blk-mq.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index c61e04365677..c596e0e4cb75 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -884,7 +884,7 @@ static inline bool blk_mq_add_to_batch(struct request *req,
 	else if (iob->complete != complete)
 		return false;
 	iob->need_ts |= blk_mq_need_time_stamp(req);
-	rq_list_add_head(&iob->req_list, req);
+	rq_list_add_tail(&iob->req_list, req);
 	return true;
 }
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: don't reorder requests passed to ->queue_rqs
  2024-11-13 15:20 don't reorder requests passed to ->queue_rqs Christoph Hellwig
                   ` (5 preceding siblings ...)
  2024-11-13 15:20 ` [PATCH 6/6] block: don't reorder requests in blk_mq_add_to_batch Christoph Hellwig
@ 2024-11-13 18:33 ` Bart Van Assche
  2024-11-13 18:39   ` Jens Axboe
  2024-11-13 18:46 ` Jens Axboe
  2024-11-13 20:36 ` Chaitanya Kulkarni
  8 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2024-11-13 18:33 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, linux-block, virtualization, linux-nvme, io-uring


On 11/13/24 7:20 AM, Christoph Hellwig wrote:
> currently blk-mq reorders requests when adding them to the plug because
> the request list can't do efficient tail appends.  When the plug is
> directly issued using ->queue_rqs that means reordered requests are
> passed to the driver, which can lead to very bad I/O patterns when
> not corrected, especially on rotational devices (e.g. NVMe HDD) or
> when using zone append.
> 
> This series first adds two easily backportable workarounds to reverse
> the reording in the virtio_blk and nvme-pci ->queue_rq implementations
> similar to what the non-queue_rqs path does, and then adds a rq_list
> type that allows for efficient tail insertions and uses that to fix
> the reordering for real and then does the same for I/O completions as
> well.

Hi Christoph,

Could something like the patch below replace this patch series? I don't 
have a strong opinion about which approach to select.
I'm sharing this patch because this is what I came up while looking into
how to support QD>1 for zoned devices with a storage controller that
preserves the request order (UFS).

Thanks,

Bart.


block: Make the plugging mechanism preserve the request order

Requests are added to the front of the plug list and dispatching happens
in list order. Hence, dispatching happens in reverse order. Dispatch in
order by reversing the plug list before dispatching. This patch is a
modified version of a patch from Jens
(https://lore.kernel.org/linux-block/[email protected]/).

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3533bd808072..bf2ea421b2e8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2868,6 +2868,22 @@ static void blk_mq_dispatch_plug_list(struct 
blk_plug *plug, bool from_sched)
  	percpu_ref_put(&this_hctx->queue->q_usage_counter);
  }

+/* See also llist_reverse_order(). */
+static void blk_plug_reverse_order(struct blk_plug *plug)
+{
+	struct request *rq = plug->mq_list, *new_head = NULL;
+
+	while (rq) {
+		struct request *tmp = rq;
+
+		rq = rq->rq_next;
+		tmp->rq_next = new_head;
+		new_head = tmp;
+	}
+
+	plug->mq_list = new_head;
+}
+
  void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
  {
  	struct request *rq;
@@ -2885,6 +2901,8 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, 
bool from_schedule)
  	depth = plug->rq_count;
  	plug->rq_count = 0;

+	blk_plug_reverse_order(plug);
+
  	if (!plug->multiple_queues && !plug->has_elevator && !from_schedule) {
  		struct request_queue *q;



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: don't reorder requests passed to ->queue_rqs
  2024-11-13 18:33 ` don't reorder requests passed to ->queue_rqs Bart Van Assche
@ 2024-11-13 18:39   ` Jens Axboe
  0 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2024-11-13 18:39 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, linux-block, virtualization, linux-nvme, io-uring

On 11/13/24 11:33 AM, Bart Van Assche wrote:
> 
> On 11/13/24 7:20 AM, Christoph Hellwig wrote:
>> currently blk-mq reorders requests when adding them to the plug because
>> the request list can't do efficient tail appends.  When the plug is
>> directly issued using ->queue_rqs that means reordered requests are
>> passed to the driver, which can lead to very bad I/O patterns when
>> not corrected, especially on rotational devices (e.g. NVMe HDD) or
>> when using zone append.
>>
>> This series first adds two easily backportable workarounds to reverse
>> the reording in the virtio_blk and nvme-pci ->queue_rq implementations
>> similar to what the non-queue_rqs path does, and then adds a rq_list
>> type that allows for efficient tail insertions and uses that to fix
>> the reordering for real and then does the same for I/O completions as
>> well.
> 
> Hi Christoph,
> 
> Could something like the patch below replace this patch series? I
> don't have a strong opinion about which approach to select.

I mean it obviously could, but it'd be a terrible way to go as we're now
iterating the full list just to reverse it...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: don't reorder requests passed to ->queue_rqs
  2024-11-13 15:20 don't reorder requests passed to ->queue_rqs Christoph Hellwig
                   ` (6 preceding siblings ...)
  2024-11-13 18:33 ` don't reorder requests passed to ->queue_rqs Bart Van Assche
@ 2024-11-13 18:46 ` Jens Axboe
  2024-11-13 20:36 ` Chaitanya Kulkarni
  8 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2024-11-13 18:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, linux-block, virtualization, linux-nvme, io-uring


On Wed, 13 Nov 2024 16:20:40 +0100, Christoph Hellwig wrote:
> currently blk-mq reorders requests when adding them to the plug because
> the request list can't do efficient tail appends.  When the plug is
> directly issued using ->queue_rqs that means reordered requests are
> passed to the driver, which can lead to very bad I/O patterns when
> not corrected, especially on rotational devices (e.g. NVMe HDD) or
> when using zone append.
> 
> [...]

Applied, thanks!

[1/6] nvme-pci: reverse request order in nvme_queue_rqs
      commit: beadf0088501d9dcf2454b05d90d5d31ea3ba55f
[2/6] virtio_blk: reverse request order in virtio_queue_rqs
      commit: a982a5637e91fd4491da4bb8eb79cde3fed2300e
[3/6] block: remove rq_list_move
      commit: dd53d238d4ca7ff1a0e10eb0205e842560f1394d
[4/6] block: add a rq_list type
      commit: dc56e2a78ba20c472c419c6f2ed8fb7d0c95ad90
[5/6] block: don't reorder requests in blk_add_rq_to_plug
      commit: 05634083a304eb3d4b5733fe8d6e2fd0c1041ca8
[6/6] block: don't reorder requests in blk_mq_add_to_batch
      commit: 65550dea43815d01f45d7ece38e458855d5e2962

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: don't reorder requests passed to ->queue_rqs
  2024-11-13 15:20 don't reorder requests passed to ->queue_rqs Christoph Hellwig
                   ` (7 preceding siblings ...)
  2024-11-13 18:46 ` Jens Axboe
@ 2024-11-13 20:36 ` Chaitanya Kulkarni
  2024-11-13 20:51   ` Jens Axboe
  8 siblings, 1 reply; 26+ messages in thread
From: Chaitanya Kulkarni @ 2024-11-13 20:36 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, [email protected],
	[email protected], [email protected],
	[email protected]

On 11/13/24 07:20, Christoph Hellwig wrote:
> Hi Jens,
>
> currently blk-mq reorders requests when adding them to the plug because
> the request list can't do efficient tail appends.  When the plug is
> directly issued using ->queue_rqs that means reordered requests are
> passed to the driver, which can lead to very bad I/O patterns when
> not corrected, especially on rotational devices (e.g. NVMe HDD) or
> when using zone append.
>
> This series first adds two easily backportable workarounds to reverse
> the reording in the virtio_blk and nvme-pci ->queue_rq implementations
> similar to what the non-queue_rqs path does, and then adds a rq_list
> type that allows for efficient tail insertions and uses that to fix
> the reordering for real and then does the same for I/O completions as
> well.

Looks good to me. I ran the quick performance numbers [1].

Reviewed-by: Chaitanya Kulkarni <[email protected]>

-ck

fio randread iouring workload :-

IOPS :-
-------
nvme-orig:           Average IOPS: 72,690
nvme-new-no-reorder: Average IOPS: 72,580

BW :-
-------
nvme-orig:           Average BW: 283.9 MiB/s
nvme-new-no-reorder: Average BW: 283.4 MiB/s

IOPS/BW :-

nvme-orig-10.fio:  read: IOPS=72.9k, BW=285MiB/s 
(299MB/s)(16.7GiB/60004msec)
nvme-orig-1.fio:  read: IOPS=72.7k, BW=284MiB/s (298MB/s)(16.6GiB/60004msec)
nvme-orig-2.fio:  read: IOPS=73.0k, BW=285MiB/s (299MB/s)(16.7GiB/60004msec)
nvme-orig-3.fio:  read: IOPS=73.3k, BW=286MiB/s (300MB/s)(16.8GiB/60003msec)
nvme-orig-4.fio:  read: IOPS=72.5k, BW=283MiB/s (297MB/s)(16.6GiB/60003msec)
nvme-orig-5.fio:  read: IOPS=72.4k, BW=283MiB/s (297MB/s)(16.6GiB/60004msec)
nvme-orig-6.fio:  read: IOPS=72.9k, BW=285MiB/s (299MB/s)(16.7GiB/60003msec)
nvme-orig-7.fio:  read: IOPS=72.3k, BW=282MiB/s (296MB/s)(16.5GiB/60004msec)
nvme-orig-8.fio:  read: IOPS=72.4k, BW=283MiB/s (296MB/s)(16.6GiB/60003msec)
nvme-orig-9.fio:  read: IOPS=72.5k, BW=283MiB/s (297MB/s)(16.6GiB/60004msec)
nvme (nvme-6.13) #
nvme (nvme-6.13) # grep BW nvme-new-no-reorder-*fio
nvme-new-no-reorder-10.fio:  read: IOPS=72.5k, BW=283MiB/s 
(297MB/s)(16.6GiB/60004msec)
nvme-new-no-reorder-1.fio:  read: IOPS=72.5k, BW=283MiB/s 
(297MB/s)(16.6GiB/60004msec)
nvme-new-no-reorder-2.fio:  read: IOPS=72.5k, BW=283MiB/s 
(297MB/s)(16.6GiB/60003msec)
nvme-new-no-reorder-3.fio:  read: IOPS=71.7k, BW=280MiB/s 
(294MB/s)(16.4GiB/60005msec)
nvme-new-no-reorder-4.fio:  read: IOPS=72.5k, BW=283MiB/s 
(297MB/s)(16.6GiB/60004msec)
nvme-new-no-reorder-5.fio:  read: IOPS=72.6k, BW=284MiB/s 
(298MB/s)(16.6GiB/60003msec)
nvme-new-no-reorder-6.fio:  read: IOPS=73.3k, BW=286MiB/s 
(300MB/s)(16.8GiB/60003msec)
nvme-new-no-reorder-7.fio:  read: IOPS=72.8k, BW=284MiB/s 
(298MB/s)(16.7GiB/60003msec)
nvme-new-no-reorder-8.fio:  read: IOPS=73.2k, BW=286MiB/s 
(300MB/s)(16.7GiB/60004msec)
nvme-new-no-reorder-9.fio:  read: IOPS=72.2k, BW=282MiB/s 
(296MB/s)(16.5GiB/60005msec)



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: don't reorder requests passed to ->queue_rqs
  2024-11-13 20:36 ` Chaitanya Kulkarni
@ 2024-11-13 20:51   ` Jens Axboe
  2024-11-13 22:23     ` Chaitanya Kulkarni
  2024-11-14  4:16     ` Christoph Hellwig
  0 siblings, 2 replies; 26+ messages in thread
From: Jens Axboe @ 2024-11-13 20:51 UTC (permalink / raw)
  To: Chaitanya Kulkarni, Christoph Hellwig
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, [email protected],
	[email protected], [email protected],
	[email protected]

On 11/13/24 1:36 PM, Chaitanya Kulkarni wrote:
> On 11/13/24 07:20, Christoph Hellwig wrote:
>> Hi Jens,
>>
>> currently blk-mq reorders requests when adding them to the plug because
>> the request list can't do efficient tail appends.  When the plug is
>> directly issued using ->queue_rqs that means reordered requests are
>> passed to the driver, which can lead to very bad I/O patterns when
>> not corrected, especially on rotational devices (e.g. NVMe HDD) or
>> when using zone append.
>>
>> This series first adds two easily backportable workarounds to reverse
>> the reording in the virtio_blk and nvme-pci ->queue_rq implementations
>> similar to what the non-queue_rqs path does, and then adds a rq_list
>> type that allows for efficient tail insertions and uses that to fix
>> the reordering for real and then does the same for I/O completions as
>> well.
> 
> Looks good to me. I ran the quick performance numbers [1].
> 
> Reviewed-by: Chaitanya Kulkarni <[email protected]>
> 
> -ck
> 
> fio randread iouring workload :-
> 
> IOPS :-
> -------
> nvme-orig:           Average IOPS: 72,690
> nvme-new-no-reorder: Average IOPS: 72,580
> 
> BW :-
> -------
> nvme-orig:           Average BW: 283.9 MiB/s
> nvme-new-no-reorder: Average BW: 283.4 MiB/s

Thanks for testing, but you can't verify any kind of perf change with
that kind of setup. I'll be willing to bet that it'll be 1-2% drop at
higher rates, which is substantial. But the reordering is a problem, not
just for zoned devices, which is why I chose to merge this.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: don't reorder requests passed to ->queue_rqs
  2024-11-13 20:51   ` Jens Axboe
@ 2024-11-13 22:23     ` Chaitanya Kulkarni
  2024-11-13 22:27       ` Jens Axboe
  2024-11-14  4:16     ` Christoph Hellwig
  1 sibling, 1 reply; 26+ messages in thread
From: Chaitanya Kulkarni @ 2024-11-13 22:23 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, [email protected],
	[email protected], [email protected],
	[email protected]

On 11/13/2024 12:51 PM, Jens Axboe wrote:
>> Looks good to me. I ran the quick performance numbers [1].
>>
>> Reviewed-by: Chaitanya Kulkarni<[email protected]>
>>
>> -ck
>>
>> fio randread iouring workload :-
>>
>> IOPS :-
>> -------
>> nvme-orig:           Average IOPS: 72,690
>> nvme-new-no-reorder: Average IOPS: 72,580
>>
>> BW :-
>> -------
>> nvme-orig:           Average BW: 283.9 MiB/s
>> nvme-new-no-reorder: Average BW: 283.4 MiB/s
> Thanks for testing, but you can't verify any kind of perf change with
> that kind of setup. I'll be willing to bet that it'll be 1-2% drop at
> higher rates, which is substantial. But the reordering is a problem, not
> just for zoned devices, which is why I chose to merge this.
> 
> -- Jens Axboe

Agree with you. My intention was to test it, since it was touching NVMe,
I thought sharing results will help either way with io_uring?
but no intention to stop this patchset and make an argument
against it (if at all) for potential drop :).

-ck



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: don't reorder requests passed to ->queue_rqs
  2024-11-13 22:23     ` Chaitanya Kulkarni
@ 2024-11-13 22:27       ` Jens Axboe
  0 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2024-11-13 22:27 UTC (permalink / raw)
  To: Chaitanya Kulkarni, Christoph Hellwig
  Cc: Michael S. Tsirkin, Jason Wang, Keith Busch, Sagi Grimberg,
	Pavel Begunkov, [email protected],
	[email protected], [email protected],
	[email protected]

On 11/13/24 3:23 PM, Chaitanya Kulkarni wrote:
> On 11/13/2024 12:51 PM, Jens Axboe wrote:
>>> Looks good to me. I ran the quick performance numbers [1].
>>>
>>> Reviewed-by: Chaitanya Kulkarni<[email protected]>
>>>
>>> -ck
>>>
>>> fio randread iouring workload :-
>>>
>>> IOPS :-
>>> -------
>>> nvme-orig:           Average IOPS: 72,690
>>> nvme-new-no-reorder: Average IOPS: 72,580
>>>
>>> BW :-
>>> -------
>>> nvme-orig:           Average BW: 283.9 MiB/s
>>> nvme-new-no-reorder: Average BW: 283.4 MiB/s
>> Thanks for testing, but you can't verify any kind of perf change with
>> that kind of setup. I'll be willing to bet that it'll be 1-2% drop at
>> higher rates, which is substantial. But the reordering is a problem, not
>> just for zoned devices, which is why I chose to merge this.
>>
>> -- Jens Axboe
> 
> Agree with you. My intention was to test it, since it was touching NVMe,
> I thought sharing results will help either way with io_uring?
> but no intention to stop this patchset and make an argument
> against it (if at all) for potential drop :).

Oh all good, and like I said, the testing is appreciated! The functional
testing is definitely useful.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: don't reorder requests passed to ->queue_rqs
  2024-11-13 20:51   ` Jens Axboe
  2024-11-13 22:23     ` Chaitanya Kulkarni
@ 2024-11-14  4:16     ` Christoph Hellwig
  2024-11-14 15:14       ` Jens Axboe
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2024-11-14  4:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Chaitanya Kulkarni, Christoph Hellwig, Michael S. Tsirkin,
	Jason Wang, Keith Busch, Sagi Grimberg, Pavel Begunkov,
	[email protected], [email protected],
	[email protected], [email protected]

On Wed, Nov 13, 2024 at 01:51:48PM -0700, Jens Axboe wrote:
> Thanks for testing, but you can't verify any kind of perf change with
> that kind of setup. I'll be willing to bet that it'll be 1-2% drop at
> higher rates, which is substantial. But the reordering is a problem, not
> just for zoned devices, which is why I chose to merge this.

So I did not see any variation, but I also don't have access to a really
beefy setup right now.  If there is a degradation it probably is that
touching rq_next for each request actually has an effect if the list is
big enough and they aren't cache hot any more.  I can cook up a patch
that goes back to the scheme currently used upstream in nvme and virtio
that just cuts of the list at a hctx change instead of moving the
requests one by one now that the block layer doesn't mess up the order.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: don't reorder requests passed to ->queue_rqs
  2024-11-14  4:16     ` Christoph Hellwig
@ 2024-11-14 15:14       ` Jens Axboe
  2024-11-18 15:23         ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-11-14 15:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chaitanya Kulkarni, Michael S. Tsirkin, Jason Wang, Keith Busch,
	Sagi Grimberg, Pavel Begunkov, [email protected],
	[email protected], [email protected],
	[email protected]

On 11/13/24 9:16 PM, Christoph Hellwig wrote:
> On Wed, Nov 13, 2024 at 01:51:48PM -0700, Jens Axboe wrote:
>> Thanks for testing, but you can't verify any kind of perf change with
>> that kind of setup. I'll be willing to bet that it'll be 1-2% drop at
>> higher rates, which is substantial. But the reordering is a problem, not
>> just for zoned devices, which is why I chose to merge this.
> 
> So I did not see any variation, but I also don't have access to a really
> beefy setup right now.  If there is a degradation it probably is that
> touching rq_next for each request actually has an effect if the list is
> big enough and they aren't cache hot any more.  I can cook up a patch

Exactly.

> that goes back to the scheme currently used upstream in nvme and virtio
> that just cuts of the list at a hctx change instead of moving the
> requests one by one now that the block layer doesn't mess up the order.

I think that would be useful. I can test.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: don't reorder requests passed to ->queue_rqs
  2024-11-14 15:14       ` Jens Axboe
@ 2024-11-18 15:23         ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2024-11-18 15:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Chaitanya Kulkarni, Michael S. Tsirkin,
	Jason Wang, Keith Busch, Sagi Grimberg, Pavel Begunkov,
	[email protected], [email protected],
	[email protected], [email protected]

On Thu, Nov 14, 2024 at 08:14:11AM -0700, Jens Axboe wrote:
> > that goes back to the scheme currently used upstream in nvme and virtio
> > that just cuts of the list at a hctx change instead of moving the
> > requests one by one now that the block layer doesn't mess up the order.
> 
> I think that would be useful. I can test.

I played with this a bit, and I really hate keeping prev for the
unlinking.  So this version uses a slightly different approach:

 ‐ a first pass that just rolls through all requests and does the prep
   checks.  If they fail that is recorded in a new field in the iod
   (could also become a request flag)
 - the submission loop inside sq_lock becomes a bit more complex
   as it handles breaking out of the inner loop for hctx changes,
   and to move the previously failed commands to the resubmit list.

But this completely gets rid of the prev tracking and all list
manipulation except for the final list_pop.

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5f2e3ad2cc52..3470dae73a8c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -233,6 +233,7 @@ struct nvme_iod {
 	struct nvme_request req;
 	struct nvme_command cmd;
 	bool aborted;
+	bool batch_failed;
 	s8 nr_allocations;	/* PRP list pool allocations. 0 means small
 				   pool in use */
 	unsigned int dma_len;	/* length of single DMA segment mapping */
@@ -843,6 +844,7 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 	blk_status_t ret;
 
 	iod->aborted = false;
+	iod->batch_failed = false;
 	iod->nr_allocations = -1;
 	iod->sgt.nents = 0;
 
@@ -904,54 +906,51 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_STS_OK;
 }
 
-static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct rq_list *rqlist)
+static void nvme_submit_cmds(struct rq_list *rqlist,
+		struct rq_list *requeue_list)
 {
-	struct request *req;
+	struct request *req = rq_list_peek(rqlist);
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 
 	spin_lock(&nvmeq->sq_lock);
 	while ((req = rq_list_pop(rqlist))) {
 		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 
-		nvme_sq_copy_cmd(nvmeq, &iod->cmd);
+		if (iod->batch_failed)
+			rq_list_add_tail(requeue_list, req);
+		else
+			nvme_sq_copy_cmd(nvmeq, &iod->cmd);
+
+		if (!req->rq_next || req->mq_hctx != req->rq_next->mq_hctx)
+			break;
 	}
 	nvme_write_sq_db(nvmeq, true);
 	spin_unlock(&nvmeq->sq_lock);
 }
 
-static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
-{
-	/*
-	 * We should not need to do this, but we're still using this to
-	 * ensure we can drain requests on a dying queue.
-	 */
-	if (unlikely(!test_bit(NVMEQ_ENABLED, &nvmeq->flags)))
-		return false;
-	if (unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)))
-		return false;
-
-	return nvme_prep_rq(nvmeq->dev, req) == BLK_STS_OK;
-}
-
 static void nvme_queue_rqs(struct rq_list *rqlist)
 {
-	struct rq_list submit_list = { };
 	struct rq_list requeue_list = { };
-	struct nvme_queue *nvmeq = NULL;
 	struct request *req;
 
-	while ((req = rq_list_pop(rqlist))) {
-		if (nvmeq && nvmeq != req->mq_hctx->driver_data)
-			nvme_submit_cmds(nvmeq, &submit_list);
-		nvmeq = req->mq_hctx->driver_data;
+	rq_list_for_each(rqlist, req) {
+		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 
-		if (nvme_prep_rq_batch(nvmeq, req))
-			rq_list_add_tail(&submit_list, req);
-		else
-			rq_list_add_tail(&requeue_list, req);
+		/*
+		 * We should not need to do this, but we're still using this to
+		 * ensure we can drain requests on a dying queue.
+		 */
+		if (unlikely(!test_bit(NVMEQ_ENABLED, &nvmeq->flags)) ||
+		    unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)) ||
+		    unlikely(nvme_prep_rq(nvmeq->dev, req) != BLK_STS_OK))
+			iod->batch_failed = true;
 	}
 
-	if (nvmeq)
-		nvme_submit_cmds(nvmeq, &submit_list);
+	do {
+		nvme_submit_cmds(rqlist, &requeue_list);
+	} while (!rq_list_empty(rqlist));
+
 	*rqlist = requeue_list;
 }
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-11-18 15:24 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-13 15:20 don't reorder requests passed to ->queue_rqs Christoph Hellwig
2024-11-13 15:20 ` [PATCH 1/6] nvme-pci: reverse request order in nvme_queue_rqs Christoph Hellwig
2024-11-13 19:10   ` Keith Busch
2024-11-13 15:20 ` [PATCH 2/6] virtio_blk: reverse request order in virtio_queue_rqs Christoph Hellwig
2024-11-13 19:03   ` Keith Busch
2024-11-13 19:05     ` Jens Axboe
2024-11-13 23:25   ` Michael S. Tsirkin
2024-11-13 15:20 ` [PATCH 3/6] block: remove rq_list_move Christoph Hellwig
2024-11-13 15:20 ` [PATCH 4/6] block: add a rq_list type Christoph Hellwig
2024-11-14 20:11   ` Nathan Chancellor
2024-11-15  9:10     ` Christoph Hellwig
2024-11-15 12:49     ` Jens Axboe
2024-11-15 19:38       ` Jens Axboe
2024-11-16  0:56         ` Nathan Chancellor
2024-11-13 15:20 ` [PATCH 5/6] block: don't reorder requests in blk_add_rq_to_plug Christoph Hellwig
2024-11-13 15:20 ` [PATCH 6/6] block: don't reorder requests in blk_mq_add_to_batch Christoph Hellwig
2024-11-13 18:33 ` don't reorder requests passed to ->queue_rqs Bart Van Assche
2024-11-13 18:39   ` Jens Axboe
2024-11-13 18:46 ` Jens Axboe
2024-11-13 20:36 ` Chaitanya Kulkarni
2024-11-13 20:51   ` Jens Axboe
2024-11-13 22:23     ` Chaitanya Kulkarni
2024-11-13 22:27       ` Jens Axboe
2024-11-14  4:16     ` Christoph Hellwig
2024-11-14 15:14       ` Jens Axboe
2024-11-18 15:23         ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox