public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH v2 0/6] dm: support IO polling for bio-based dm device
@ 2021-01-25 12:13 Jeffle Xu
  2021-01-25 12:13 ` [PATCH v2 1/6] block: move definition of blk_qc_t to types.h Jeffle Xu
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Jeffle Xu @ 2021-01-25 12:13 UTC (permalink / raw)
  To: snitzer; +Cc: joseph.qi, dm-devel, linux-block, io-uring

Since currently we have no simple but efficient way to implement the
bio-based IO polling in the split-bio tracking style, this patch set
turns to the original implementation mechanism that iterates and
polls all underlying hw queues in polling mode. One optimization is
introduced to mitigate the race of one hw queue among multiple polling
instances.

I'm still open to the split bio tracking mechanism, if there's
reasonable way to implement it.


[Performance Test]
The performance is tested by fio (engine=io_uring) 4k randread on
dm-linear device. The dm-linear device is built upon nvme devices,
and every nvme device has one polling hw queue (nvme.poll_queues=1).

Test Case		    | IOPS in IRQ mode | IOPS in polling mode | Diff
			    | (hipri=0)	       | (hipri=1)	      |
--------------------------- | ---------------- | -------------------- | ----
3 target nvme, num_jobs = 1 | 198k 	       | 276k		      | ~40%
3 target nvme, num_jobs = 3 | 608k 	       | 705k		      | ~16%
6 target nvme, num_jobs = 6 | 1197k 	       | 1347k		      | ~13%
3 target nvme, num_jobs = 6 | 1285k 	       | 1293k		      | ~0%

As the number of polling instances (num_jobs) increases, the
performance improvement decreases, though it's still positive
compared to the IRQ mode.

[Optimization]
To mitigate the race when iterating all the underlying hw queues, one
flag is maintained on a per-hw-queue basis. This flag is used to
indicate whether this polling hw queue currently being polled on or
not. Every polling hw queue is exclusive to one polling instance, i.e.,
the polling instance will skip this polling hw queue if this hw queue
currently is being polled by another polling instance, and start
polling on the next hw queue.

This per-hw-queue flag map is currently maintained in dm layer. In
the table load phase, a table describing all underlying polling hw
queues is built and stored in 'struct dm_table'. It is safe when
reloading the mapping table.


changes since v1:
- patch 1,2,4 is the same as v1 and have already been reviewed
- patch 3 is refactored a bit on the basis of suggestions from
Mike Snitzer.
- patch 5 is newly added and introduces one new queue flag
representing if the queue is capable of IO polling. This mainly
simplifies the logic in queue_poll_store().
- patch 6 implements the core mechanism supporting IO polling.
The sanity check checking if the dm device supports IO polling is
also folded into this patch, and the queue flag will be cleared if
it doesn't support, in case of table reloading.


Jeffle Xu (6):
  block: move definition of blk_qc_t to types.h
  block: add queue_to_disk() to get gendisk from request_queue
  block: add iopoll method to support bio-based IO polling
  dm: always return BLK_QC_T_NONE for bio-based device
  block: add QUEUE_FLAG_POLL_CAP flag
  dm: support IO polling for bio-based dm device

 block/blk-core.c             |  76 +++++++++++++++++++++
 block/blk-mq.c               |  76 +++------------------
 block/blk-sysfs.c            |   3 +-
 drivers/md/dm-core.h         |  21 ++++++
 drivers/md/dm-table.c        | 127 +++++++++++++++++++++++++++++++++++
 drivers/md/dm.c              |  61 ++++++++++++-----
 include/linux/blk-mq.h       |   3 +
 include/linux/blk_types.h    |   2 +-
 include/linux/blkdev.h       |   9 +++
 include/linux/fs.h           |   2 +-
 include/linux/types.h        |   3 +
 include/trace/events/kyber.h |   6 +-
 12 files changed, 302 insertions(+), 87 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/6] block: move definition of blk_qc_t to types.h
  2021-01-25 12:13 [PATCH v2 0/6] dm: support IO polling for bio-based dm device Jeffle Xu
@ 2021-01-25 12:13 ` Jeffle Xu
  2021-01-25 12:13 ` [PATCH v2 2/6] block: add queue_to_disk() to get gendisk from request_queue Jeffle Xu
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Jeffle Xu @ 2021-01-25 12:13 UTC (permalink / raw)
  To: snitzer; +Cc: joseph.qi, dm-devel, linux-block, io-uring

So that kiocb.ki_cookie can be defined as blk_qc_t, which will enforce
the encapsulation.

Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
---
 include/linux/blk_types.h | 2 +-
 include/linux/fs.h        | 2 +-
 include/linux/types.h     | 3 +++
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 866f74261b3b..2e05244fc16d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -532,7 +532,7 @@ static inline int op_stat_group(unsigned int op)
 	return op_is_write(op);
 }
 
-typedef unsigned int blk_qc_t;
+/* Macros for blk_qc_t */
 #define BLK_QC_T_NONE		-1U
 #define BLK_QC_T_SHIFT		16
 #define BLK_QC_T_INTERNAL	(1U << 31)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..04b687150736 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -330,7 +330,7 @@ struct kiocb {
 	u16			ki_hint;
 	u16			ki_ioprio; /* See linux/ioprio.h */
 	union {
-		unsigned int		ki_cookie; /* for ->iopoll */
+		blk_qc_t		ki_cookie; /* for ->iopoll */
 		struct wait_page_queue	*ki_waitq; /* for async buffered IO */
 	};
 
diff --git a/include/linux/types.h b/include/linux/types.h
index a147977602b5..da5ca7e1bea9 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -125,6 +125,9 @@ typedef s64			int64_t;
 typedef u64 sector_t;
 typedef u64 blkcnt_t;
 
+/* cookie used for IO polling */
+typedef unsigned int blk_qc_t;
+
 /*
  * The type of an index into the pagecache.
  */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/6] block: add queue_to_disk() to get gendisk from request_queue
  2021-01-25 12:13 [PATCH v2 0/6] dm: support IO polling for bio-based dm device Jeffle Xu
  2021-01-25 12:13 ` [PATCH v2 1/6] block: move definition of blk_qc_t to types.h Jeffle Xu
@ 2021-01-25 12:13 ` Jeffle Xu
  2021-01-27 17:20   ` Mike Snitzer
  2021-01-25 12:13 ` [PATCH v2 3/6] block: add iopoll method to support bio-based IO polling Jeffle Xu
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Jeffle Xu @ 2021-01-25 12:13 UTC (permalink / raw)
  To: snitzer; +Cc: joseph.qi, dm-devel, linux-block, io-uring

Sometimes we need to get the corresponding gendisk from request_queue.

It is preferred that block drivers store private data in
gendisk->private_data rather than request_queue->queuedata, e.g. see:
commit c4a59c4e5db3 ("dm: stop using ->queuedata").

So if only request_queue is given, we need to get its corresponding
gendisk to get the private data stored in that gendisk.

Signed-off-by: Jeffle Xu <[email protected]>
Review-by: Mike Snitzer <[email protected]>
---
 include/linux/blkdev.h       | 2 ++
 include/trace/events/kyber.h | 6 +++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f94ee3089e01..2a802e011a95 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -687,6 +687,8 @@ static inline bool blk_account_rq(struct request *rq)
 	dma_map_page_attrs(dev, (bv)->bv_page, (bv)->bv_offset, (bv)->bv_len, \
 	(dir), (attrs))
 
+#define queue_to_disk(q)	(dev_to_disk(kobj_to_dev((q)->kobj.parent)))
+
 static inline bool queue_is_mq(struct request_queue *q)
 {
 	return q->mq_ops;
diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
index c0e7d24ca256..f9802562edf6 100644
--- a/include/trace/events/kyber.h
+++ b/include/trace/events/kyber.h
@@ -30,7 +30,7 @@ TRACE_EVENT(kyber_latency,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		strlcpy(__entry->type, type, sizeof(__entry->type));
 		__entry->percentile	= percentile;
@@ -59,7 +59,7 @@ TRACE_EVENT(kyber_adjust,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		__entry->depth		= depth;
 	),
@@ -81,7 +81,7 @@ TRACE_EVENT(kyber_throttled,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 	),
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 3/6] block: add iopoll method to support bio-based IO polling
  2021-01-25 12:13 [PATCH v2 0/6] dm: support IO polling for bio-based dm device Jeffle Xu
  2021-01-25 12:13 ` [PATCH v2 1/6] block: move definition of blk_qc_t to types.h Jeffle Xu
  2021-01-25 12:13 ` [PATCH v2 2/6] block: add queue_to_disk() to get gendisk from request_queue Jeffle Xu
@ 2021-01-25 12:13 ` Jeffle Xu
  2021-01-27 17:14   ` Mike Snitzer
  2021-01-28  8:40   ` Christoph Hellwig
  2021-01-25 12:13 ` [PATCH v2 4/6] dm: always return BLK_QC_T_NONE for bio-based device Jeffle Xu
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 17+ messages in thread
From: Jeffle Xu @ 2021-01-25 12:13 UTC (permalink / raw)
  To: snitzer; +Cc: joseph.qi, dm-devel, linux-block, io-uring

->poll_fn was introduced in commit ea435e1b9392 ("block: add a poll_fn
callback to struct request_queue") to support bio-based queues such as
nvme multipath, but was later removed in commit 529262d56dbe ("block:
remove ->poll_fn").

Given commit c62b37d96b6e ("block: move ->make_request_fn to struct
block_device_operations") restore the possibility of bio-based IO
polling support by adding an ->iopoll method to gendisk->fops.
Elevate bulk of blk_mq_poll() implementation to blk_poll() and reduce
blk_mq_poll() to blk-mq specific code that is called from blk_poll().

Signed-off-by: Jeffle Xu <[email protected]>
Suggested-by: Mike Snitzer <[email protected]>
---
 block/blk-core.c       | 70 +++++++++++++++++++++++++++++++++++++++
 block/blk-mq.c         | 74 ++++++------------------------------------
 include/linux/blk-mq.h |  3 ++
 include/linux/blkdev.h |  1 +
 4 files changed, 84 insertions(+), 64 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7663a9b94b80..3d93aaa9a49b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1132,6 +1132,76 @@ blk_qc_t submit_bio(struct bio *bio)
 }
 EXPORT_SYMBOL(submit_bio);
 
+/**
+ * blk_poll - poll for IO completions
+ * @q:  the queue
+ * @cookie: cookie passed back at IO submission time
+ * @spin: whether to spin for completions
+ *
+ * Description:
+ *    Poll for completions on the passed in queue. Returns number of
+ *    completed entries found. If @spin is true, then blk_poll will continue
+ *    looping until at least one completion is found, unless the task is
+ *    otherwise marked running (or we need to reschedule).
+ */
+int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+{
+	long state;
+	struct blk_mq_hw_ctx *hctx = NULL;
+	struct gendisk *disk = NULL;
+
+	if (!blk_qc_t_valid(cookie) ||
+	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+		return 0;
+
+	if (current->plug)
+		blk_flush_plug_list(current->plug, false);
+
+	if (queue_is_mq(q)) {
+		/*
+		 * If we sleep, have the caller restart the poll loop to reset
+		 * the state. Like for the other success return cases, the
+		 * caller is responsible for checking if the IO completed. If
+		 * the IO isn't complete, we'll get called again and will go
+		 * straight to the busy poll loop. If specified not to spin,
+		 * we also should not sleep.
+		 */
+		hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+		if (spin && blk_mq_poll_hybrid(q, hctx, cookie))
+			return 1;
+		hctx->poll_considered++;
+	} else
+		disk = queue_to_disk(q);
+
+	state = current->state;
+	do {
+		int ret;
+
+		if (hctx)
+			ret = blk_mq_poll(q, hctx);
+		else if (disk->fops->iopoll)
+			ret = disk->fops->iopoll(q, cookie);
+
+		if (ret > 0) {
+			__set_current_state(TASK_RUNNING);
+			return ret;
+		}
+
+		if (signal_pending_state(state, current))
+			__set_current_state(TASK_RUNNING);
+
+		if (current->state == TASK_RUNNING)
+			return 1;
+		if (ret < 0 || !spin)
+			break;
+		cpu_relax();
+	} while (!need_resched());
+
+	__set_current_state(TASK_RUNNING);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(blk_poll);
+
 /**
  * blk_cloned_rq_check_limits - Helper function to check a cloned request
  *                              for the new queue limits
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f285a9123a8b..d058b9cbdf76 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3801,8 +3801,8 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
 	return true;
 }
 
-static bool blk_mq_poll_hybrid(struct request_queue *q,
-			       struct blk_mq_hw_ctx *hctx, blk_qc_t cookie)
+bool blk_mq_poll_hybrid(struct request_queue *q,
+			struct blk_mq_hw_ctx *hctx, blk_qc_t cookie)
 {
 	struct request *rq;
 
@@ -3826,72 +3826,18 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
-/**
- * blk_poll - poll for IO completions
- * @q:  the queue
- * @cookie: cookie passed back at IO submission time
- * @spin: whether to spin for completions
- *
- * Description:
- *    Poll for completions on the passed in queue. Returns number of
- *    completed entries found. If @spin is true, then blk_poll will continue
- *    looping until at least one completion is found, unless the task is
- *    otherwise marked running (or we need to reschedule).
- */
-int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+int blk_mq_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx)
 {
-	struct blk_mq_hw_ctx *hctx;
-	long state;
-
-	if (!blk_qc_t_valid(cookie) ||
-	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
-		return 0;
-
-	if (current->plug)
-		blk_flush_plug_list(current->plug, false);
-
-	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
-
-	/*
-	 * If we sleep, have the caller restart the poll loop to reset
-	 * the state. Like for the other success return cases, the
-	 * caller is responsible for checking if the IO completed. If
-	 * the IO isn't complete, we'll get called again and will go
-	 * straight to the busy poll loop. If specified not to spin,
-	 * we also should not sleep.
-	 */
-	if (spin && blk_mq_poll_hybrid(q, hctx, cookie))
-		return 1;
-
-	hctx->poll_considered++;
-
-	state = current->state;
-	do {
-		int ret;
-
-		hctx->poll_invoked++;
-
-		ret = q->mq_ops->poll(hctx);
-		if (ret > 0) {
-			hctx->poll_success++;
-			__set_current_state(TASK_RUNNING);
-			return ret;
-		}
-
-		if (signal_pending_state(state, current))
-			__set_current_state(TASK_RUNNING);
+	int ret;
 
-		if (current->state == TASK_RUNNING)
-			return 1;
-		if (ret < 0 || !spin)
-			break;
-		cpu_relax();
-	} while (!need_resched());
+	hctx->poll_invoked++;
+	ret = q->mq_ops->poll(hctx);
+	if (ret > 0)
+		hctx->poll_success++;
 
-	__set_current_state(TASK_RUNNING);
-	return 0;
+	return ret;
 }
-EXPORT_SYMBOL_GPL(blk_poll);
+EXPORT_SYMBOL(blk_mq_poll);
 
 unsigned int blk_mq_rq_cpu(struct request *rq)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d705b174d346..8062a2d046a9 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -607,6 +607,9 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
 }
 
 blk_qc_t blk_mq_submit_bio(struct bio *bio);
+int blk_mq_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx);
+bool blk_mq_poll_hybrid(struct request_queue *q,
+		struct blk_mq_hw_ctx *hctx, blk_qc_t cookie);
 void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
 		struct lock_class_key *key);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2a802e011a95..bc540df197cb 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1853,6 +1853,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { }
 
 struct block_device_operations {
 	blk_qc_t (*submit_bio) (struct bio *bio);
+	int (*iopoll)(struct request_queue *q, blk_qc_t cookie);
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
 	int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 4/6] dm: always return BLK_QC_T_NONE for bio-based device
  2021-01-25 12:13 [PATCH v2 0/6] dm: support IO polling for bio-based dm device Jeffle Xu
                   ` (2 preceding siblings ...)
  2021-01-25 12:13 ` [PATCH v2 3/6] block: add iopoll method to support bio-based IO polling Jeffle Xu
@ 2021-01-25 12:13 ` Jeffle Xu
  2021-01-25 12:13 ` [PATCH v2 5/6] block: add QUEUE_FLAG_POLL_CAP flag Jeffle Xu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Jeffle Xu @ 2021-01-25 12:13 UTC (permalink / raw)
  To: snitzer; +Cc: joseph.qi, dm-devel, linux-block, io-uring

Currently the returned cookie of bio-based device is not used at all.

In the following patch, bio-based device will actually return a
pointer to a specific object as the returned cookie.

Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
---
 drivers/md/dm.c | 26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7bac564f3faa..46ca3b739396 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1252,14 +1252,13 @@ void dm_accept_partial_bio(struct bio *bio, unsigned n_sectors)
 }
 EXPORT_SYMBOL_GPL(dm_accept_partial_bio);
 
-static blk_qc_t __map_bio(struct dm_target_io *tio)
+static void __map_bio(struct dm_target_io *tio)
 {
 	int r;
 	sector_t sector;
 	struct bio *clone = &tio->clone;
 	struct dm_io *io = tio->io;
 	struct dm_target *ti = tio->ti;
-	blk_qc_t ret = BLK_QC_T_NONE;
 
 	clone->bi_end_io = clone_endio;
 
@@ -1278,7 +1277,7 @@ static blk_qc_t __map_bio(struct dm_target_io *tio)
 	case DM_MAPIO_REMAPPED:
 		/* the bio has been remapped so dispatch it */
 		trace_block_bio_remap(clone, bio_dev(io->orig_bio), sector);
-		ret = submit_bio_noacct(clone);
+		submit_bio_noacct(clone);
 		break;
 	case DM_MAPIO_KILL:
 		free_tio(tio);
@@ -1292,8 +1291,6 @@ static blk_qc_t __map_bio(struct dm_target_io *tio)
 		DMWARN("unimplemented target map return value: %d", r);
 		BUG();
 	}
-
-	return ret;
 }
 
 static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len)
@@ -1380,7 +1377,7 @@ static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
 	}
 }
 
-static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci,
+static void __clone_and_map_simple_bio(struct clone_info *ci,
 					   struct dm_target_io *tio, unsigned *len)
 {
 	struct bio *clone = &tio->clone;
@@ -1391,7 +1388,7 @@ static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci,
 	if (len)
 		bio_setup_sector(clone, ci->sector, *len);
 
-	return __map_bio(tio);
+	__map_bio(tio);
 }
 
 static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
@@ -1405,7 +1402,7 @@ static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
 
 	while ((bio = bio_list_pop(&blist))) {
 		tio = container_of(bio, struct dm_target_io, clone);
-		(void) __clone_and_map_simple_bio(ci, tio, len);
+		__clone_and_map_simple_bio(ci, tio, len);
 	}
 }
 
@@ -1450,7 +1447,7 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 		free_tio(tio);
 		return r;
 	}
-	(void) __map_bio(tio);
+	__map_bio(tio);
 
 	return 0;
 }
@@ -1565,11 +1562,10 @@ static void init_clone_info(struct clone_info *ci, struct mapped_device *md,
 /*
  * Entry point to split a bio into clones and submit them to the targets.
  */
-static blk_qc_t __split_and_process_bio(struct mapped_device *md,
+static void __split_and_process_bio(struct mapped_device *md,
 					struct dm_table *map, struct bio *bio)
 {
 	struct clone_info ci;
-	blk_qc_t ret = BLK_QC_T_NONE;
 	int error = 0;
 
 	init_clone_info(&ci, md, map, bio);
@@ -1613,7 +1609,7 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md,
 
 				bio_chain(b, bio);
 				trace_block_split(b, bio->bi_iter.bi_sector);
-				ret = submit_bio_noacct(bio);
+				submit_bio_noacct(bio);
 				break;
 			}
 		}
@@ -1621,13 +1617,11 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md,
 
 	/* drop the extra reference count */
 	dec_pending(ci.io, errno_to_blk_status(error));
-	return ret;
 }
 
 static blk_qc_t dm_submit_bio(struct bio *bio)
 {
 	struct mapped_device *md = bio->bi_disk->private_data;
-	blk_qc_t ret = BLK_QC_T_NONE;
 	int srcu_idx;
 	struct dm_table *map;
 
@@ -1657,10 +1651,10 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
 	if (is_abnormal_io(bio))
 		blk_queue_split(&bio);
 
-	ret = __split_and_process_bio(md, map, bio);
+	__split_and_process_bio(md, map, bio);
 out:
 	dm_put_live_table(md, srcu_idx);
-	return ret;
+	return BLK_QC_T_NONE;
 }
 
 /*-----------------------------------------------------------------
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 5/6] block: add QUEUE_FLAG_POLL_CAP flag
  2021-01-25 12:13 [PATCH v2 0/6] dm: support IO polling for bio-based dm device Jeffle Xu
                   ` (3 preceding siblings ...)
  2021-01-25 12:13 ` [PATCH v2 4/6] dm: always return BLK_QC_T_NONE for bio-based device Jeffle Xu
@ 2021-01-25 12:13 ` Jeffle Xu
  2021-01-27 17:13   ` Mike Snitzer
  2021-01-25 12:13 ` [PATCH v2 6/6] dm: support IO polling for bio-based dm device Jeffle Xu
  2021-01-27 17:19 ` [PATCH v2 0/6] " Mike Snitzer
  6 siblings, 1 reply; 17+ messages in thread
From: Jeffle Xu @ 2021-01-25 12:13 UTC (permalink / raw)
  To: snitzer; +Cc: joseph.qi, dm-devel, linux-block, io-uring

Introduce QUEUE_FLAG_POLL_CAP flag representing if the request queue
capable of polling or not.

Signed-off-by: Jeffle Xu <[email protected]>
---
 block/blk-mq.c         | 2 +-
 block/blk-sysfs.c      | 3 +--
 include/linux/blkdev.h | 6 ++++++
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d058b9cbdf76..10f06337d8dc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3203,7 +3203,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
 	if (set->nr_maps > HCTX_TYPE_POLL &&
 	    set->map[HCTX_TYPE_POLL].nr_queues)
-		blk_queue_flag_set(QUEUE_FLAG_POLL, q);
+		q->queue_flags |= QUEUE_FLAG_POLL_MASK;
 
 	q->sg_reserved_size = INT_MAX;
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b513f1683af0..65693efb7c76 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -420,8 +420,7 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	unsigned long poll_on;
 	ssize_t ret;
 
-	if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
-	    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
+	if (!blk_queue_poll_cap(q))
 		return -EINVAL;
 
 	ret = queue_var_store(&poll_on, page, count);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bc540df197cb..095b486de02f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -621,11 +621,16 @@ struct request_queue {
 #define QUEUE_FLAG_RQ_ALLOC_TIME 27	/* record rq->alloc_time_ns */
 #define QUEUE_FLAG_HCTX_ACTIVE	28	/* at least one blk-mq hctx is active */
 #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
+#define QUEUE_FLAG_POLL_CAP	30	/* capable of IO polling */
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP) |		\
 				 (1 << QUEUE_FLAG_NOWAIT))
 
+#define QUEUE_FLAG_POLL_MASK	((1 << QUEUE_FLAG_POLL) |		\
+				 (1<< QUEUE_FLAG_POLL_CAP))
+
+
 void blk_queue_flag_set(unsigned int flag, struct request_queue *q);
 void blk_queue_flag_clear(unsigned int flag, struct request_queue *q);
 bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
@@ -667,6 +672,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
 #define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
 #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
+#define blk_queue_poll_cap(q)	test_bit(QUEUE_FLAG_POLL_CAP, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 6/6] dm: support IO polling for bio-based dm device
  2021-01-25 12:13 [PATCH v2 0/6] dm: support IO polling for bio-based dm device Jeffle Xu
                   ` (4 preceding siblings ...)
  2021-01-25 12:13 ` [PATCH v2 5/6] block: add QUEUE_FLAG_POLL_CAP flag Jeffle Xu
@ 2021-01-25 12:13 ` Jeffle Xu
  2021-01-29  7:37   ` JeffleXu
  2021-01-27 17:19 ` [PATCH v2 0/6] " Mike Snitzer
  6 siblings, 1 reply; 17+ messages in thread
From: Jeffle Xu @ 2021-01-25 12:13 UTC (permalink / raw)
  To: snitzer; +Cc: joseph.qi, dm-devel, linux-block, io-uring

DM will iterate and poll all polling hardware queues of all target mq
devices when polling IO for dm device. To mitigate the race introduced
by iterating all target hw queues, a per-hw-queue flag is maintained
to indicate whether this polling hw queue currently being polled on or
not. Every polling hw queue is exclusive to one polling instance, i.e.,
the polling instance will skip this polling hw queue if this hw queue
currently is being polled by another polling instance, and start
polling on the next hw queue.

IO polling is enabled when all underlying target devices are capable
of IO polling. The sanity check supports the stacked device model, in
which one dm device may be build upon another dm device. In this case,
the mapped device will check if the underlying dm target device
supports IO polling.

Signed-off-by: Jeffle Xu <[email protected]>
---
 block/blk-core.c      |   8 ++-
 drivers/md/dm-core.h  |  21 +++++++
 drivers/md/dm-table.c | 127 ++++++++++++++++++++++++++++++++++++++++++
 drivers/md/dm.c       |  37 ++++++++++++
 4 files changed, 192 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3d93aaa9a49b..d81a5f0faa1d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1150,7 +1150,13 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	struct blk_mq_hw_ctx *hctx = NULL;
 	struct gendisk *disk = NULL;
 
-	if (!blk_qc_t_valid(cookie) ||
+	/*
+	 * In case of bio-base polling, the returned cookie is actually that of
+	 * the last split bio. Thus the returned cookie may be BLK_QC_T_NONE,
+	 * while the previous split bios have already been submitted and queued
+	 * into the polling hw queue.
+	 */
+	if ((queue_is_mq(q) && !blk_qc_t_valid(cookie)) ||
 	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
 		return 0;
 
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 086d293c2b03..5a0391aa5cc7 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -127,6 +127,24 @@ static inline struct dm_stats *dm_get_stats(struct mapped_device *md)
 
 #define DM_TABLE_MAX_DEPTH 16
 
+enum target_ctx_type {
+	TARGET_TYPE_MQ,
+	TARGET_TYPE_DM,
+};
+
+struct target_ctx {
+	union {
+		struct {
+			struct blk_mq_hw_ctx *hctx;
+			struct request_queue *q;
+			atomic_t busy;
+		};
+		struct mapped_device *md;
+	};
+
+	int type;
+};
+
 struct dm_table {
 	struct mapped_device *md;
 	enum dm_queue_mode type;
@@ -162,6 +180,9 @@ struct dm_table {
 	void *event_context;
 
 	struct dm_md_mempools *mempools;
+
+	int num_ctx;
+	struct target_ctx *ctxs;
 };
 
 static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 188f41287f18..397bb5f57626 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -215,6 +215,8 @@ void dm_table_destroy(struct dm_table *t)
 
 	dm_free_md_mempools(t->mempools);
 
+	kfree(t->ctxs);
+
 	kfree(t);
 }
 
@@ -1194,6 +1196,114 @@ static int dm_table_register_integrity(struct dm_table *t)
 	return 0;
 }
 
+static int device_supports_poll(struct dm_target *ti, struct dm_dev *dev,
+				sector_t start, sector_t len, void *data)
+{
+	struct request_queue *q = bdev_get_queue(dev->bdev);
+
+	return q && test_bit(QUEUE_FLAG_POLL, &q->queue_flags);
+}
+
+static bool dm_table_supports_poll(struct dm_table *t)
+{
+	struct dm_target *ti;
+	unsigned int i;
+
+	/* Ensure that all targets support iopoll. */
+	for (i = 0; i < dm_table_get_num_targets(t); i++) {
+		ti = dm_table_get_target(t, i);
+
+		if (!ti->type->iterate_devices ||
+		    !ti->type->iterate_devices(ti, device_supports_poll, NULL))
+			return false;
+	}
+
+	return true;
+}
+
+static int dm_table_calc_target_ctxs(struct dm_target *ti,
+		struct dm_dev *dev,
+		sector_t start, sector_t len,
+		void *data)
+{
+	int *num = data;
+	struct request_queue *q = dev->bdev->bd_disk->queue;
+
+	if (queue_is_mq(q))
+		*num += q->tag_set->map[HCTX_TYPE_POLL].nr_queues;
+	else
+		*num += 1;
+
+	return 0;
+}
+
+static int dm_table_fill_target_ctxs(struct dm_target *ti,
+		struct dm_dev *dev,
+		sector_t start, sector_t len,
+		void *data)
+{
+	int *index = data;
+	struct target_ctx *ctx;
+	struct request_queue *q = dev->bdev->bd_disk->queue;
+
+	if (queue_is_mq(q)) {
+		int i;
+		int num = q->tag_set->map[HCTX_TYPE_POLL].nr_queues;
+		int offset = q->tag_set->map[HCTX_TYPE_POLL].queue_offset;
+
+		for (i = 0; i < num; i++) {
+			ctx = &ti->table->ctxs[(*index)++];
+			ctx->q = q;
+			ctx->hctx = q->queue_hw_ctx[offset + i];
+			ctx->type = TARGET_TYPE_MQ;
+			/* ctx->busy has been initialized to zero */
+		}
+	} else {
+		struct mapped_device *md = dev->bdev->bd_disk->private_data;
+
+		ctx = &ti->table->ctxs[(*index)++];
+		ctx->md = md;
+		ctx->type = TARGET_TYPE_DM;
+	}
+
+	return 0;
+}
+
+static int dm_table_build_target_ctxs(struct dm_table *t)
+{
+	int i, num = 0, index = 0;
+
+	if (!__table_type_bio_based(t->type) || !dm_table_supports_poll(t))
+		return 0;
+
+	for (i = 0; i < t->num_targets; i++) {
+		struct dm_target *ti = dm_table_get_target(t, i);
+
+		if (ti->type->iterate_devices)
+			ti->type->iterate_devices(ti, dm_table_calc_target_ctxs,
+					&num);
+	}
+
+	if (WARN_ON(!num))
+		return 0;
+
+	t->num_ctx = num;
+
+	t->ctxs = kcalloc(num, sizeof(struct target_ctx), GFP_KERNEL);
+	if (!t->ctxs)
+		return -ENOMEM;
+
+	for (i = 0; i < t->num_targets; i++) {
+		struct dm_target *ti = dm_table_get_target(t, i);
+
+		if (ti->type->iterate_devices)
+			ti->type->iterate_devices(ti, dm_table_fill_target_ctxs,
+					&index);
+	}
+
+	return 0;
+}
+
 /*
  * Prepares the table for use by building the indices,
  * setting the type, and allocating mempools.
@@ -1224,6 +1334,10 @@ int dm_table_complete(struct dm_table *t)
 	if (r)
 		DMERR("unable to allocate mempools");
 
+	r = dm_table_build_target_ctxs(t);
+	if (r)
+		DMERR("unable to build target hctxs");
+
 	return r;
 }
 
@@ -1883,6 +1997,19 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 #endif
 
 	blk_queue_update_readahead(q);
+
+	/*
+	 * Check for request-based device is remained to
+	 * dm_mq_init_request_queue()->blk_mq_init_allocated_queue().
+	 * Also clear previously set QUEUE_FLAG_POLL* if the new table doesn't
+	 * support iopoll while reloading table.
+	 */
+	if (__table_type_bio_based(t->type)) {
+		if (t->ctxs)
+			q->queue_flags |= QUEUE_FLAG_POLL_MASK;
+		else
+			q->queue_flags &= ~QUEUE_FLAG_POLL_MASK;
+	}
 }
 
 unsigned int dm_table_get_num_targets(struct dm_table *t)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 46ca3b739396..e2be1caa086a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1657,6 +1657,42 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
 	return BLK_QC_T_NONE;
 }
 
+
+static int dm_poll_one_md(struct mapped_device *md)
+{
+	int i, num_ctx, srcu_idx, ret = 0;
+	struct dm_table *t;
+	struct target_ctx *ctxs;
+
+	t = dm_get_live_table(md, &srcu_idx);
+	num_ctx = t->num_ctx;
+	ctxs = t->ctxs;
+
+	for (i = 0; i < num_ctx; i++) {
+		struct target_ctx *ctx = &ctxs[i];
+
+		if (ctx->type == TARGET_TYPE_MQ) {
+			if (!atomic_cmpxchg(&ctx->busy, 0, 1)) {
+				ret += blk_mq_poll(ctx->q, ctx->hctx);
+				atomic_set(&ctx->busy, 0);
+			}
+		} else
+			ret += dm_poll_one_md(ctx->md);
+	}
+
+	dm_put_live_table(md, srcu_idx);
+
+	return ret;
+}
+
+static int dm_bio_poll(struct request_queue *q, blk_qc_t cookie)
+{
+	struct gendisk *disk = queue_to_disk(q);
+	struct mapped_device *md= disk->private_data;
+
+	return dm_poll_one_md(md);
+}
+
 /*-----------------------------------------------------------------
  * An IDR is used to keep track of allocated minor numbers.
  *---------------------------------------------------------------*/
@@ -3049,6 +3085,7 @@ static const struct pr_ops dm_pr_ops = {
 };
 
 static const struct block_device_operations dm_blk_dops = {
+	.iopoll = dm_bio_poll,
 	.submit_bio = dm_submit_bio,
 	.open = dm_blk_open,
 	.release = dm_blk_close,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 5/6] block: add QUEUE_FLAG_POLL_CAP flag
  2021-01-25 12:13 ` [PATCH v2 5/6] block: add QUEUE_FLAG_POLL_CAP flag Jeffle Xu
@ 2021-01-27 17:13   ` Mike Snitzer
  2021-01-28  2:07     ` JeffleXu
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Snitzer @ 2021-01-27 17:13 UTC (permalink / raw)
  To: Jeffle Xu; +Cc: joseph.qi, dm-devel, linux-block, io-uring

On Mon, Jan 25 2021 at  7:13am -0500,
Jeffle Xu <[email protected]> wrote:

> Introduce QUEUE_FLAG_POLL_CAP flag representing if the request queue
> capable of polling or not.
> 
> Signed-off-by: Jeffle Xu <[email protected]>

Why are you adding QUEUE_FLAG_POLL_CAP?  Doesn't seem as though DM or
anything else actually needs it.

Mike


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/6] block: add iopoll method to support bio-based IO polling
  2021-01-25 12:13 ` [PATCH v2 3/6] block: add iopoll method to support bio-based IO polling Jeffle Xu
@ 2021-01-27 17:14   ` Mike Snitzer
  2021-01-28  8:40   ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Mike Snitzer @ 2021-01-27 17:14 UTC (permalink / raw)
  To: Jeffle Xu; +Cc: joseph.qi, dm-devel, linux-block, io-uring

On Mon, Jan 25 2021 at  7:13am -0500,
Jeffle Xu <[email protected]> wrote:

> ->poll_fn was introduced in commit ea435e1b9392 ("block: add a poll_fn
> callback to struct request_queue") to support bio-based queues such as
> nvme multipath, but was later removed in commit 529262d56dbe ("block:
> remove ->poll_fn").
> 
> Given commit c62b37d96b6e ("block: move ->make_request_fn to struct
> block_device_operations") restore the possibility of bio-based IO
> polling support by adding an ->iopoll method to gendisk->fops.
> Elevate bulk of blk_mq_poll() implementation to blk_poll() and reduce
> blk_mq_poll() to blk-mq specific code that is called from blk_poll().
> 
> Signed-off-by: Jeffle Xu <[email protected]>
> Suggested-by: Mike Snitzer <[email protected]>

Reviewed-by: Mike Snitzer <[email protected]>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/6] dm: support IO polling for bio-based dm device
  2021-01-25 12:13 [PATCH v2 0/6] dm: support IO polling for bio-based dm device Jeffle Xu
                   ` (5 preceding siblings ...)
  2021-01-25 12:13 ` [PATCH v2 6/6] dm: support IO polling for bio-based dm device Jeffle Xu
@ 2021-01-27 17:19 ` Mike Snitzer
  2021-01-28  3:06   ` JeffleXu
  6 siblings, 1 reply; 17+ messages in thread
From: Mike Snitzer @ 2021-01-27 17:19 UTC (permalink / raw)
  To: Jeffle Xu; +Cc: joseph.qi, dm-devel, linux-block, io-uring

On Mon, Jan 25 2021 at  7:13am -0500,
Jeffle Xu <[email protected]> wrote:

> Since currently we have no simple but efficient way to implement the
> bio-based IO polling in the split-bio tracking style, this patch set
> turns to the original implementation mechanism that iterates and
> polls all underlying hw queues in polling mode. One optimization is
> introduced to mitigate the race of one hw queue among multiple polling
> instances.
> 
> I'm still open to the split bio tracking mechanism, if there's
> reasonable way to implement it.
> 
> 
> [Performance Test]
> The performance is tested by fio (engine=io_uring) 4k randread on
> dm-linear device. The dm-linear device is built upon nvme devices,
> and every nvme device has one polling hw queue (nvme.poll_queues=1).
> 
> Test Case		    | IOPS in IRQ mode | IOPS in polling mode | Diff
> 			    | (hipri=0)	       | (hipri=1)	      |
> --------------------------- | ---------------- | -------------------- | ----
> 3 target nvme, num_jobs = 1 | 198k 	       | 276k		      | ~40%
> 3 target nvme, num_jobs = 3 | 608k 	       | 705k		      | ~16%
> 6 target nvme, num_jobs = 6 | 1197k 	       | 1347k		      | ~13%
> 3 target nvme, num_jobs = 6 | 1285k 	       | 1293k		      | ~0%
> 
> As the number of polling instances (num_jobs) increases, the
> performance improvement decreases, though it's still positive
> compared to the IRQ mode.

I think there is serious room for improvement for DM's implementation;
but the block changes for this are all we'd need for DM in the longrun
anyway (famous last words). So on a block interface level I'm OK with
block patches 1-3.

I don't see why patch 5 is needed (said the same in reply to it; but I
just saw your reason below..).

Anyway, I can pick up DM patches 4 and 6 via linux-dm.git if Jens picks
up patches 1-3. Jens, what do you think?

> [Optimization]
> To mitigate the race when iterating all the underlying hw queues, one
> flag is maintained on a per-hw-queue basis. This flag is used to
> indicate whether this polling hw queue currently being polled on or
> not. Every polling hw queue is exclusive to one polling instance, i.e.,
> the polling instance will skip this polling hw queue if this hw queue
> currently is being polled by another polling instance, and start
> polling on the next hw queue.
> 
> This per-hw-queue flag map is currently maintained in dm layer. In
> the table load phase, a table describing all underlying polling hw
> queues is built and stored in 'struct dm_table'. It is safe when
> reloading the mapping table.
> 
> 
> changes since v1:
> - patch 1,2,4 is the same as v1 and have already been reviewed
> - patch 3 is refactored a bit on the basis of suggestions from
> Mike Snitzer.
> - patch 5 is newly added and introduces one new queue flag
> representing if the queue is capable of IO polling. This mainly
> simplifies the logic in queue_poll_store().

Ah OK, don't see why we want to eat a queue flag for that though!

> - patch 6 implements the core mechanism supporting IO polling.
> The sanity check checking if the dm device supports IO polling is
> also folded into this patch, and the queue flag will be cleared if
> it doesn't support, in case of table reloading.

Thanks,
Mike


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/6] block: add queue_to_disk() to get gendisk from request_queue
  2021-01-25 12:13 ` [PATCH v2 2/6] block: add queue_to_disk() to get gendisk from request_queue Jeffle Xu
@ 2021-01-27 17:20   ` Mike Snitzer
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Snitzer @ 2021-01-27 17:20 UTC (permalink / raw)
  To: Jeffle Xu; +Cc: joseph.qi, dm-devel, linux-block, io-uring

On Mon, Jan 25 2021 at  7:13am -0500,
Jeffle Xu <[email protected]> wrote:

> Sometimes we need to get the corresponding gendisk from request_queue.
> 
> It is preferred that block drivers store private data in
> gendisk->private_data rather than request_queue->queuedata, e.g. see:
> commit c4a59c4e5db3 ("dm: stop using ->queuedata").
> 
> So if only request_queue is given, we need to get its corresponding
> gendisk to get the private data stored in that gendisk.
> 
> Signed-off-by: Jeffle Xu <[email protected]>
> Review-by: Mike Snitzer <[email protected]>

^typo

Reviewed-by: Mike Snitzer <[email protected]>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 5/6] block: add QUEUE_FLAG_POLL_CAP flag
  2021-01-27 17:13   ` Mike Snitzer
@ 2021-01-28  2:07     ` JeffleXu
  0 siblings, 0 replies; 17+ messages in thread
From: JeffleXu @ 2021-01-28  2:07 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: joseph.qi, dm-devel, linux-block, io-uring



On 1/28/21 1:13 AM, Mike Snitzer wrote:
> On Mon, Jan 25 2021 at  7:13am -0500,
> Jeffle Xu <[email protected]> wrote:
> 
>> Introduce QUEUE_FLAG_POLL_CAP flag representing if the request queue
>> capable of polling or not.
>>
>> Signed-off-by: Jeffle Xu <[email protected]>
> 
> Why are you adding QUEUE_FLAG_POLL_CAP?  Doesn't seem as though DM or
> anything else actually needs it.

Users can switch on/off polling on device via
'/sys/block/<dev>/queue/io_poll' at runtime. The requisite for turning
on polling is that the device is **capable** of polling. For mq devices,
the requisite is that there's polling hw queue for the device, i.e.,

```
q->tag_set->nr_maps > HCTX_TYPE_POLL &&
q->tag_set->map[HCTX_TYPE_POLL].nr_queues
```

But for dm devices, we need to check if all the underlying devices
support polling or not. Without this newly added queue flag, we need to
check again every time users want to turn on polling via 'io_poll', and
thus the dm layer need to export one interface to block layer, checking
if all the underlying target devices support polling or not, maybe just
like the iopoll() method we did in patch 3. Something like,

```
 struct block_device_operations {
+	bool (*support_iopoll)(struct request_queue *q);
```

The newly added queue flag 'QUEUE_FLAG_POLL_CAP' is just used as a cache
representing if the device **capable** of polling, while the original
queue flag 'QUEUE_FLAG_POLL' representing if polling is turned on for
this device **currently**.


But indeed we are short of queue flag resource. Adding a new queue flag
may not be the best resolution.

Any inspiration?


-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/6] dm: support IO polling for bio-based dm device
  2021-01-27 17:19 ` [PATCH v2 0/6] " Mike Snitzer
@ 2021-01-28  3:06   ` JeffleXu
  0 siblings, 0 replies; 17+ messages in thread
From: JeffleXu @ 2021-01-28  3:06 UTC (permalink / raw)
  To: Mike Snitzer, Jens Axboe; +Cc: joseph.qi, dm-devel, linux-block, io-uring



On 1/28/21 1:19 AM, Mike Snitzer wrote:
> On Mon, Jan 25 2021 at  7:13am -0500,
> Jeffle Xu <[email protected]> wrote:
> 
>> Since currently we have no simple but efficient way to implement the
>> bio-based IO polling in the split-bio tracking style, this patch set
>> turns to the original implementation mechanism that iterates and
>> polls all underlying hw queues in polling mode. One optimization is
>> introduced to mitigate the race of one hw queue among multiple polling
>> instances.
>>
>> I'm still open to the split bio tracking mechanism, if there's
>> reasonable way to implement it.
>>
>>
>> [Performance Test]
>> The performance is tested by fio (engine=io_uring) 4k randread on
>> dm-linear device. The dm-linear device is built upon nvme devices,
>> and every nvme device has one polling hw queue (nvme.poll_queues=1).
>>
>> Test Case		    | IOPS in IRQ mode | IOPS in polling mode | Diff
>> 			    | (hipri=0)	       | (hipri=1)	      |
>> --------------------------- | ---------------- | -------------------- | ----
>> 3 target nvme, num_jobs = 1 | 198k 	       | 276k		      | ~40%
>> 3 target nvme, num_jobs = 3 | 608k 	       | 705k		      | ~16%
>> 6 target nvme, num_jobs = 6 | 1197k 	       | 1347k		      | ~13%
>> 3 target nvme, num_jobs = 6 | 1285k 	       | 1293k		      | ~0%
>>
>> As the number of polling instances (num_jobs) increases, the
>> performance improvement decreases, though it's still positive
>> compared to the IRQ mode.
> 
> I think there is serious room for improvement for DM's implementation;
> but the block changes for this are all we'd need for DM in the longrun
> anyway (famous last words).

Agreed.


> So on a block interface level I'm OK with
> block patches 1-3.
> 
> I don't see why patch 5 is needed (said the same in reply to it; but I
> just saw your reason below..).
> 
> Anyway, I can pick up DM patches 4 and 6 via linux-dm.git if Jens picks
> up patches 1-3. Jens, what do you think?

cc Jens.

Also I will send a new version later, maybe some refactor on patch5 and
some typo modifications.

> 
>> [Optimization]
>> To mitigate the race when iterating all the underlying hw queues, one
>> flag is maintained on a per-hw-queue basis. This flag is used to
>> indicate whether this polling hw queue currently being polled on or
>> not. Every polling hw queue is exclusive to one polling instance, i.e.,
>> the polling instance will skip this polling hw queue if this hw queue
>> currently is being polled by another polling instance, and start
>> polling on the next hw queue.
>>
>> This per-hw-queue flag map is currently maintained in dm layer. In
>> the table load phase, a table describing all underlying polling hw
>> queues is built and stored in 'struct dm_table'. It is safe when
>> reloading the mapping table.
>>
>>
>> changes since v1:
>> - patch 1,2,4 is the same as v1 and have already been reviewed
>> - patch 3 is refactored a bit on the basis of suggestions from
>> Mike Snitzer.
>> - patch 5 is newly added and introduces one new queue flag
>> representing if the queue is capable of IO polling. This mainly
>> simplifies the logic in queue_poll_store().
> 
> Ah OK, don't see why we want to eat a queue flag for that though!
> 
>> - patch 6 implements the core mechanism supporting IO polling.
>> The sanity check checking if the dm device supports IO polling is
>> also folded into this patch, and the queue flag will be cleared if
>> it doesn't support, in case of table reloading.
> 
> Thanks,
> Mike
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/6] block: add iopoll method to support bio-based IO polling
  2021-01-25 12:13 ` [PATCH v2 3/6] block: add iopoll method to support bio-based IO polling Jeffle Xu
  2021-01-27 17:14   ` Mike Snitzer
@ 2021-01-28  8:40   ` Christoph Hellwig
  2021-01-28 11:52     ` JeffleXu
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2021-01-28  8:40 UTC (permalink / raw)
  To: Jeffle Xu; +Cc: snitzer, joseph.qi, dm-devel, linux-block, io-uring

On Mon, Jan 25, 2021 at 08:13:37PM +0800, Jeffle Xu wrote:
> +int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)

Can you split the guts of this function into two separate helpers
for the mq vs non-mq case?  As is is is a little hard to read and
introduced extra branches in the fast path.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/6] block: add iopoll method to support bio-based IO polling
  2021-01-28  8:40   ` Christoph Hellwig
@ 2021-01-28 11:52     ` JeffleXu
  2021-01-28 14:36       ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: JeffleXu @ 2021-01-28 11:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: snitzer, joseph.qi, dm-devel, linux-block, io-uring



On 1/28/21 4:40 PM, Christoph Hellwig wrote:
> On Mon, Jan 25, 2021 at 08:13:37PM +0800, Jeffle Xu wrote:
>> +int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> 
> Can you split the guts of this function into two separate helpers
> for the mq vs non-mq case?  As is is is a little hard to read and
> introduced extra branches in the fast path.
> 

I know your consideration, actually I had ever tried.

I can extract some helper functions, but I'm doubted if the extra
function call is acceptable.

Besides, the iteration logic is generic and I'm afraid the branch or
function call is unavoidable. Or if we maintain two separate function
for mq and dm, the code duplication may be unavoidable.

Anyway I'll give a try.

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/6] block: add iopoll method to support bio-based IO polling
  2021-01-28 11:52     ` JeffleXu
@ 2021-01-28 14:36       ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2021-01-28 14:36 UTC (permalink / raw)
  To: JeffleXu
  Cc: Christoph Hellwig, snitzer, joseph.qi, dm-devel, linux-block,
	io-uring

On Thu, Jan 28, 2021 at 07:52:05PM +0800, JeffleXu wrote:
> 
> 
> On 1/28/21 4:40 PM, Christoph Hellwig wrote:
> > On Mon, Jan 25, 2021 at 08:13:37PM +0800, Jeffle Xu wrote:
> >> +int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> > 
> > Can you split the guts of this function into two separate helpers
> > for the mq vs non-mq case?  As is is is a little hard to read and
> > introduced extra branches in the fast path.
> > 
> 
> I know your consideration, actually I had ever tried.
> 
> I can extract some helper functions, but I'm doubted if the extra
> function call is acceptable.
> 
> Besides, the iteration logic is generic and I'm afraid the branch or
> function call is unavoidable. Or if we maintain two separate function
> for mq and dm, the code duplication may be unavoidable.

I'd just split the functions entirely at the highest level.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 6/6] dm: support IO polling for bio-based dm device
  2021-01-25 12:13 ` [PATCH v2 6/6] dm: support IO polling for bio-based dm device Jeffle Xu
@ 2021-01-29  7:37   ` JeffleXu
  0 siblings, 0 replies; 17+ messages in thread
From: JeffleXu @ 2021-01-29  7:37 UTC (permalink / raw)
  To: snitzer; +Cc: joseph.qi, dm-devel, linux-block, io-uring



On 1/25/21 8:13 PM, Jeffle Xu wrote:
> DM will iterate and poll all polling hardware queues of all target mq
> devices when polling IO for dm device. To mitigate the race introduced
> by iterating all target hw queues, a per-hw-queue flag is maintained
> to indicate whether this polling hw queue currently being polled on or
> not. Every polling hw queue is exclusive to one polling instance, i.e.,
> the polling instance will skip this polling hw queue if this hw queue
> currently is being polled by another polling instance, and start
> polling on the next hw queue.
> 
> IO polling is enabled when all underlying target devices are capable
> of IO polling. The sanity check supports the stacked device model, in
> which one dm device may be build upon another dm device. In this case,
> the mapped device will check if the underlying dm target device
> supports IO polling.
> 
> Signed-off-by: Jeffle Xu <[email protected]>
> ---
>  block/blk-core.c      |   8 ++-
>  drivers/md/dm-core.h  |  21 +++++++
>  drivers/md/dm-table.c | 127 ++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/dm.c       |  37 ++++++++++++
>  4 files changed, 192 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3d93aaa9a49b..d81a5f0faa1d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1150,7 +1150,13 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	struct blk_mq_hw_ctx *hctx = NULL;
>  	struct gendisk *disk = NULL;
>  
> -	if (!blk_qc_t_valid(cookie) ||
> +	/*
> +	 * In case of bio-base polling, the returned cookie is actually that of
> +	 * the last split bio. Thus the returned cookie may be BLK_QC_T_NONE,
> +	 * while the previous split bios have already been submitted and queued
> +	 * into the polling hw queue.
> +	 */
> +	if ((queue_is_mq(q) && !blk_qc_t_valid(cookie)) ||
>  	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
>  		return 0;
>  
> diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> index 086d293c2b03..5a0391aa5cc7 100644
> --- a/drivers/md/dm-core.h
> +++ b/drivers/md/dm-core.h
> @@ -127,6 +127,24 @@ static inline struct dm_stats *dm_get_stats(struct mapped_device *md)
>  
>  #define DM_TABLE_MAX_DEPTH 16
>  
> +enum target_ctx_type {
> +	TARGET_TYPE_MQ,
> +	TARGET_TYPE_DM,
> +};
> +
> +struct target_ctx {
> +	union {
> +		struct {
> +			struct blk_mq_hw_ctx *hctx;
> +			struct request_queue *q;
> +			atomic_t busy;
> +		};
> +		struct mapped_device *md;
> +	};
> +
> +	int type;
> +};
> +

I suddenly realize that this implementation is somehow problematic. This
implementation actually buffers hctx of underlying mq device. This can
be problematic when the hctx of uderlying mq device can be dynamiclly
changed at runtime.

For example, nvme RESET command can trigger this issue. Users can send
nvme RESET command while there's already one dm device built upon this
nvme device. In this case, the hctx map of nvme device will be
reallocated when there's one dm device built upon this nvme device. And
the original old 'struct blk_mq_hw_ctx *hctx' buffered in the dm layer
can be out-of-date.

Maybe we could embed 'atomic_t busy' in 'struct blk_mq_hw_ctx', and
block layer needs to export one interface iterating all hw queues of
given device, just like what I did in the very first RFC version patch:


+#define queue_for_each_poll_hw_ctx(q, hctx, i)			\
+for ((i) = 0; ((q)->tag_set->nr_maps > HCTX_TYPE_POLL) &&	\
+     (i) < (q)->tag_set->map[HCTX_TYPE_POLL].nr_queues &&	\
+({ hctx =
(q)->queue_hw_ctx[((q)->tag_set->map[HCTX_TYPE_POLL].queue_offset +
(i))]; 1; }); \
+(i)++)



>  struct dm_table {
>  	struct mapped_device *md;
>  	enum dm_queue_mode type;
> @@ -162,6 +180,9 @@ struct dm_table {
>  	void *event_context;
>  
>  	struct dm_md_mempools *mempools;
> +
> +	int num_ctx;
> +	struct target_ctx *ctxs;
>  };
>  
>  static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj)
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 188f41287f18..397bb5f57626 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -215,6 +215,8 @@ void dm_table_destroy(struct dm_table *t)
>  
>  	dm_free_md_mempools(t->mempools);
>  
> +	kfree(t->ctxs);
> +
>  	kfree(t);
>  }
>  
> @@ -1194,6 +1196,114 @@ static int dm_table_register_integrity(struct dm_table *t)
>  	return 0;
>  }
>  
> +static int device_supports_poll(struct dm_target *ti, struct dm_dev *dev,
> +				sector_t start, sector_t len, void *data)
> +{
> +	struct request_queue *q = bdev_get_queue(dev->bdev);
> +
> +	return q && test_bit(QUEUE_FLAG_POLL, &q->queue_flags);
> +}
> +
> +static bool dm_table_supports_poll(struct dm_table *t)
> +{
> +	struct dm_target *ti;
> +	unsigned int i;
> +
> +	/* Ensure that all targets support iopoll. */
> +	for (i = 0; i < dm_table_get_num_targets(t); i++) {
> +		ti = dm_table_get_target(t, i);
> +
> +		if (!ti->type->iterate_devices ||
> +		    !ti->type->iterate_devices(ti, device_supports_poll, NULL))
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
> +static int dm_table_calc_target_ctxs(struct dm_target *ti,
> +		struct dm_dev *dev,
> +		sector_t start, sector_t len,
> +		void *data)
> +{
> +	int *num = data;
> +	struct request_queue *q = dev->bdev->bd_disk->queue;
> +
> +	if (queue_is_mq(q))
> +		*num += q->tag_set->map[HCTX_TYPE_POLL].nr_queues;
> +	else
> +		*num += 1;
> +
> +	return 0;
> +}
> +
> +static int dm_table_fill_target_ctxs(struct dm_target *ti,
> +		struct dm_dev *dev,
> +		sector_t start, sector_t len,
> +		void *data)
> +{
> +	int *index = data;
> +	struct target_ctx *ctx;
> +	struct request_queue *q = dev->bdev->bd_disk->queue;
> +
> +	if (queue_is_mq(q)) {
> +		int i;
> +		int num = q->tag_set->map[HCTX_TYPE_POLL].nr_queues;
> +		int offset = q->tag_set->map[HCTX_TYPE_POLL].queue_offset;
> +
> +		for (i = 0; i < num; i++) {
> +			ctx = &ti->table->ctxs[(*index)++];
> +			ctx->q = q;
> +			ctx->hctx = q->queue_hw_ctx[offset + i];
> +			ctx->type = TARGET_TYPE_MQ;
> +			/* ctx->busy has been initialized to zero */
> +		}
> +	} else {
> +		struct mapped_device *md = dev->bdev->bd_disk->private_data;
> +
> +		ctx = &ti->table->ctxs[(*index)++];
> +		ctx->md = md;
> +		ctx->type = TARGET_TYPE_DM;
> +	}
> +
> +	return 0;
> +}
> +
> +static int dm_table_build_target_ctxs(struct dm_table *t)
> +{
> +	int i, num = 0, index = 0;
> +
> +	if (!__table_type_bio_based(t->type) || !dm_table_supports_poll(t))
> +		return 0;
> +
> +	for (i = 0; i < t->num_targets; i++) {
> +		struct dm_target *ti = dm_table_get_target(t, i);
> +
> +		if (ti->type->iterate_devices)
> +			ti->type->iterate_devices(ti, dm_table_calc_target_ctxs,
> +					&num);
> +	}
> +
> +	if (WARN_ON(!num))
> +		return 0;
> +
> +	t->num_ctx = num;
> +
> +	t->ctxs = kcalloc(num, sizeof(struct target_ctx), GFP_KERNEL);
> +	if (!t->ctxs)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < t->num_targets; i++) {
> +		struct dm_target *ti = dm_table_get_target(t, i);
> +
> +		if (ti->type->iterate_devices)
> +			ti->type->iterate_devices(ti, dm_table_fill_target_ctxs,
> +					&index);
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * Prepares the table for use by building the indices,
>   * setting the type, and allocating mempools.
> @@ -1224,6 +1334,10 @@ int dm_table_complete(struct dm_table *t)
>  	if (r)
>  		DMERR("unable to allocate mempools");
>  
> +	r = dm_table_build_target_ctxs(t);
> +	if (r)
> +		DMERR("unable to build target hctxs");
> +
>  	return r;
>  }
>  
> @@ -1883,6 +1997,19 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>  #endif
>  
>  	blk_queue_update_readahead(q);
> +
> +	/*
> +	 * Check for request-based device is remained to
> +	 * dm_mq_init_request_queue()->blk_mq_init_allocated_queue().
> +	 * Also clear previously set QUEUE_FLAG_POLL* if the new table doesn't
> +	 * support iopoll while reloading table.
> +	 */
> +	if (__table_type_bio_based(t->type)) {
> +		if (t->ctxs)
> +			q->queue_flags |= QUEUE_FLAG_POLL_MASK;
> +		else
> +			q->queue_flags &= ~QUEUE_FLAG_POLL_MASK;
> +	}
>  }
>  
>  unsigned int dm_table_get_num_targets(struct dm_table *t)
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 46ca3b739396..e2be1caa086a 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1657,6 +1657,42 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
>  	return BLK_QC_T_NONE;
>  }
>  
> +
> +static int dm_poll_one_md(struct mapped_device *md)
> +{
> +	int i, num_ctx, srcu_idx, ret = 0;
> +	struct dm_table *t;
> +	struct target_ctx *ctxs;
> +
> +	t = dm_get_live_table(md, &srcu_idx);
> +	num_ctx = t->num_ctx;
> +	ctxs = t->ctxs;
> +
> +	for (i = 0; i < num_ctx; i++) {
> +		struct target_ctx *ctx = &ctxs[i];
> +
> +		if (ctx->type == TARGET_TYPE_MQ) {
> +			if (!atomic_cmpxchg(&ctx->busy, 0, 1)) {
> +				ret += blk_mq_poll(ctx->q, ctx->hctx);
> +				atomic_set(&ctx->busy, 0);
> +			}
> +		} else
> +			ret += dm_poll_one_md(ctx->md);
> +	}
> +
> +	dm_put_live_table(md, srcu_idx);
> +
> +	return ret;
> +}
> +
> +static int dm_bio_poll(struct request_queue *q, blk_qc_t cookie)
> +{
> +	struct gendisk *disk = queue_to_disk(q);
> +	struct mapped_device *md= disk->private_data;
> +
> +	return dm_poll_one_md(md);
> +}
> +
>  /*-----------------------------------------------------------------
>   * An IDR is used to keep track of allocated minor numbers.
>   *---------------------------------------------------------------*/
> @@ -3049,6 +3085,7 @@ static const struct pr_ops dm_pr_ops = {
>  };
>  
>  static const struct block_device_operations dm_blk_dops = {
> +	.iopoll = dm_bio_poll,
>  	.submit_bio = dm_submit_bio,
>  	.open = dm_blk_open,
>  	.release = dm_blk_close,
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-01-29  7:39 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-01-25 12:13 [PATCH v2 0/6] dm: support IO polling for bio-based dm device Jeffle Xu
2021-01-25 12:13 ` [PATCH v2 1/6] block: move definition of blk_qc_t to types.h Jeffle Xu
2021-01-25 12:13 ` [PATCH v2 2/6] block: add queue_to_disk() to get gendisk from request_queue Jeffle Xu
2021-01-27 17:20   ` Mike Snitzer
2021-01-25 12:13 ` [PATCH v2 3/6] block: add iopoll method to support bio-based IO polling Jeffle Xu
2021-01-27 17:14   ` Mike Snitzer
2021-01-28  8:40   ` Christoph Hellwig
2021-01-28 11:52     ` JeffleXu
2021-01-28 14:36       ` Christoph Hellwig
2021-01-25 12:13 ` [PATCH v2 4/6] dm: always return BLK_QC_T_NONE for bio-based device Jeffle Xu
2021-01-25 12:13 ` [PATCH v2 5/6] block: add QUEUE_FLAG_POLL_CAP flag Jeffle Xu
2021-01-27 17:13   ` Mike Snitzer
2021-01-28  2:07     ` JeffleXu
2021-01-25 12:13 ` [PATCH v2 6/6] dm: support IO polling for bio-based dm device Jeffle Xu
2021-01-29  7:37   ` JeffleXu
2021-01-27 17:19 ` [PATCH v2 0/6] " Mike Snitzer
2021-01-28  3:06   ` JeffleXu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox