* [PATCH v3 00/11] dm: support IO polling
@ 2021-02-08 8:52 Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 01/11] block: move definition of blk_qc_t to types.h Jeffle Xu
` (12 more replies)
0 siblings, 13 replies; 27+ messages in thread
From: Jeffle Xu @ 2021-02-08 8:52 UTC (permalink / raw)
To: snitzer, axboe; +Cc: joseph.qi, caspar, hch, linux-block, dm-devel, io-uring
[Performance]
1. One thread (numjobs=1) randread (bs=4k, direct=1) one dm-linear
device, which is built upon 3 nvme devices, with one polling hw
queue per nvme device.
| IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
---- | --------------- | -------------------- | ----
dm | 208k | 279k | ~34%
2. Three threads (numjobs=3) randread (bs=4k, direct=1) one dm-linear
device, which is built upon 3 nvme devices, with one polling hw
queue per nvme device.
It's compared to 3 threads directly randread 3 nvme devices, with one
polling hw queue per nvme device. No CPU affinity set for these 3
threads. Thus every thread can access every nvme device
(filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread
need to competing for every polling hw queue.
| IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
---- | --------------- | -------------------- | ----
dm | 615k | 728k | ~18%
nvme | 728k | 873k | ~20%
The result shows that the performance gain of bio-based polling is
comparable as that of mq polling in the same test case.
3. Three threads (numjobs=3) randread (bs=12k, direct=1) one
**dm-stripe** device, which is built upon 3 nvme devices, with one
polling hw queue per nvme device.
It's compared to 3 threads directly randread 3 nvme devices, with one
polling hw queue per nvme device. No CPU affinity set for these 3
threads. Thus every thread can access every nvme device
(filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread
need to competing for every polling hw queue.
| IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
---- | --------------- | -------------------- | ----
dm | 314k | 354k | ~13%
nvme | 728k | 873k | ~20%
4. This patchset shall do no harm to the performance of the original mq
polling. Following is the test results of one thread (numjobs=1)
randread (bs=4k, direct=1) one nvme device.
| IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
---------------- | --------------- | -------------------- | ----
without patchset | 242k | 332k | ~39%
with patchset | 236k | 332k | ~39%
[Changes since v2]
Patchset v2 caches all hw queues (in polling mode) of underlying mq
devices in dm layer. The polling routine actually iterates through all
these cached hw queues.
However, mq may change the queue mapping at runtime (e.g., NVMe RESET
command), thus the cached hw queues in dm layer may be out-of-date. Thus
patchset v3 falls back to the implementation of the very first RFC
version, in which the mq layer needs to export one interface iterating
all polling hw queues (patch 5), and the bio-based polling routine just
calls this interface to iterate all polling hw queues.
Besides, several new optimization is proposed.
- patch 1,2,7
same as v2, untouched
- patch 3
Considering advice from Christoph Hellwig, while refactoring blk_poll(),
split mq and bio-based polling routine from the very beginning. Now
blk_poll() is just a simple entry. blk_bio_poll() is simply copied from
blk_mq_poll(), while the loop structure is some sort of duplication
though.
- patch 4
This patch is newly added to support turning on/off polling through
'/sys/block/<dev>/queue/io_poll' dynamiclly for bio-based devices.
Patchset v2 implemented this functionality by added one new queue flag,
which is not preferred since the queue flag resource is quite short of
nowadays.
- patch 5
This patch is newly added, preparing for the following bio-based
polling. The following bio-based polling will call this helper function,
accounting on the corresponding hw queue.
- patch 6
It's from the very first RFC version, preparing for the following
bio-based polling.
- patch 8
One fixing patch needed by the following bio-based polling. It's
actually a v2 of [1]. I had sent the v2 singly in-reply-to [1], though
it has not been visible on the mailing list maybe due to the delay.
- patch 9
It's from the very first RFC version.
- patch 10
This patch is newly added. Patchset v2 had ever proposed one
optimization that, skipping the **busy** hw queues during the iteration
phase. Back upon that time, one flag of 'atomic_t' is specifically
maintained in dm layer, representing if the corresponding hw queue is
busy or not. The idea is inherited, while the implementation changes.
Now @nvmeq->cq_poll_lock is used directly here, no need for extra flag
anymore.
This optimization can significantly reduce the competition for one hw
queue between multiple polling instances. Following statistics is the
test result when 3 threads concurrently randread (bs=4k, direct=1) one
dm-linear device, which is built upon 3 nvme devices, with one polling
hw queue per nvme device.
| IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
----------- | --------------- | -------------------- | ----
without opt | 318k | 256k | ~-20%
with opt | 314k | 354k | ~13%
- patch 11
This is another newly added optimizatin for bio-based polling.
One intuitive insight is that, when the original bio submitted to dm
device doesn't get split, then the bio gets enqueued into only one hw
queue of one of the underlying mq devices. In this case, we no longer
need to track all split bios, and one cookie (for the only split bio)
is enough. It is implemented by returning the pointer to the
corresponding hw queue in this case.
It should be safe by directly returning the pointer to the hw queue,
since 'struct blk_mq_hw_ctx' won't be freed during the whole lifetime of
'struct request_queue'. Even when the number of hw queues may decrease
when NVMe RESET happens, the 'struct request_queue' structure of decreased
hw queues won't be freed, instead it's buffered into
&q->unused_hctx_list list.
Though this optimization seems quite intuitive, the performance test
shows that it does no benefit nor harm to the performance, while 3
threads concurrently randreading (bs=4k, direct=1) one dm-linear
device, which is built upon 3 nvme devices, with one polling hw queue
per nvme device.
I'm not sure why it doesn't work, maybe because the number of devices,
or the depth of the devcice stack is to low in my test case?
[Remained Issue]
It has been mentioned in patch 4 that, users could change the state of
the underlying devices through '/sys/block/<dev>/io_poll', bypassing
the dm device above. Thus it can cause a situation where QUEUE_FLAG_POLL
is still set for the request_queue of dm device, while one of the
underlying mq device may has cleared this flag.
In this case, it will pass the 'test_bit(QUEUE_FLAG_POLL, &q->queue_flags)'
check in blk_poll(), while the input cookie may actually points to a hw
queue in IRQ mode since patch 11. Thus for this hw queue (in IRQ mode),
the bio-based polling routine will handle this hw queue acquiring
'spin_lock(&nvmeq->cq_poll_lock)' (refer
drivers/nvme/host/pci.c:nvme_poll), which is not adequate since this hw
queue may also be accessed in IRQ context. In other words,
spin_lock_irq() should be used here.
I have not come up one simple way to fix it. I don't want to do sanity
check (e.g., the type of the hw queue is HCTX_TYPE_POLL or not) in the
IO path (submit_bio()/blk_poll()), i.e., fast path.
We'd better fix it in the control path, i.e., dm could be aware of the
change when attribute (e.g., support io_poll or not) of one of the
underlying devices changed at runtime.
[1] https://listman.redhat.com/archives/dm-devel/2021-February/msg00028.html
changes since v1:
- patch 1,2,4 is the same as v1 and have already been reviewed
- patch 3 is refactored a bit on the basis of suggestions from
Mike Snitzer.
- patch 5 is newly added and introduces one new queue flag
representing if the queue is capable of IO polling. This mainly
simplifies the logic in queue_poll_store().
- patch 6 implements the core mechanism supporting IO polling.
The sanity check checking if the dm device supports IO polling is
also folded into this patch, and the queue flag will be cleared if
it doesn't support, in case of table reloading.
Jeffle Xu (11):
block: move definition of blk_qc_t to types.h
block: add queue_to_disk() to get gendisk from request_queue
block: add poll method to support bio-based IO polling
block: add poll_capable method to support bio-based IO polling
block/mq: extract one helper function polling hw queue
block/mq: add iterator for polling hw queues
dm: always return BLK_QC_T_NONE for bio-based device
dm: fix iterate_device sanity check
dm: support IO polling for bio-based dm device
nvme/pci: don't wait for locked polling queue
dm: fastpath of bio-based polling
block/blk-core.c | 89 +++++++++++++-
block/blk-mq.c | 27 +----
block/blk-sysfs.c | 14 ++-
drivers/md/dm-table.c | 215 ++++++++++++++++++----------------
drivers/md/dm.c | 90 +++++++++++---
drivers/md/dm.h | 2 +-
drivers/nvme/host/pci.c | 4 +-
include/linux/blk-mq.h | 21 ++++
include/linux/blk_types.h | 10 +-
include/linux/blkdev.h | 4 +
include/linux/device-mapper.h | 1 +
include/linux/fs.h | 2 +-
include/linux/types.h | 3 +
include/trace/events/kyber.h | 6 +-
14 files changed, 333 insertions(+), 155 deletions(-)
--
2.27.0
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v3 01/11] block: move definition of blk_qc_t to types.h
2021-02-08 8:52 [PATCH v3 00/11] dm: support IO polling Jeffle Xu
@ 2021-02-08 8:52 ` Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 02/11] block: add queue_to_disk() to get gendisk from request_queue Jeffle Xu
` (11 subsequent siblings)
12 siblings, 0 replies; 27+ messages in thread
From: Jeffle Xu @ 2021-02-08 8:52 UTC (permalink / raw)
To: snitzer, axboe; +Cc: joseph.qi, caspar, hch, linux-block, dm-devel, io-uring
So that kiocb.ki_cookie can be defined as blk_qc_t, which will enforce
the encapsulation.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
---
include/linux/blk_types.h | 2 +-
include/linux/fs.h | 2 +-
include/linux/types.h | 3 +++
3 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 866f74261b3b..2e05244fc16d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -532,7 +532,7 @@ static inline int op_stat_group(unsigned int op)
return op_is_write(op);
}
-typedef unsigned int blk_qc_t;
+/* Macros for blk_qc_t */
#define BLK_QC_T_NONE -1U
#define BLK_QC_T_SHIFT 16
#define BLK_QC_T_INTERNAL (1U << 31)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..04b687150736 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -330,7 +330,7 @@ struct kiocb {
u16 ki_hint;
u16 ki_ioprio; /* See linux/ioprio.h */
union {
- unsigned int ki_cookie; /* for ->iopoll */
+ blk_qc_t ki_cookie; /* for ->iopoll */
struct wait_page_queue *ki_waitq; /* for async buffered IO */
};
diff --git a/include/linux/types.h b/include/linux/types.h
index a147977602b5..da5ca7e1bea9 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -125,6 +125,9 @@ typedef s64 int64_t;
typedef u64 sector_t;
typedef u64 blkcnt_t;
+/* cookie used for IO polling */
+typedef unsigned int blk_qc_t;
+
/*
* The type of an index into the pagecache.
*/
--
2.27.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 02/11] block: add queue_to_disk() to get gendisk from request_queue
2021-02-08 8:52 [PATCH v3 00/11] dm: support IO polling Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 01/11] block: move definition of blk_qc_t to types.h Jeffle Xu
@ 2021-02-08 8:52 ` Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 03/11] block: add poll method to support bio-based IO polling Jeffle Xu
` (10 subsequent siblings)
12 siblings, 0 replies; 27+ messages in thread
From: Jeffle Xu @ 2021-02-08 8:52 UTC (permalink / raw)
To: snitzer, axboe; +Cc: joseph.qi, caspar, hch, linux-block, dm-devel, io-uring
Sometimes we need to get the corresponding gendisk from request_queue.
It is preferred that block drivers store private data in
gendisk->private_data rather than request_queue->queuedata, e.g. see:
commit c4a59c4e5db3 ("dm: stop using ->queuedata").
So if only request_queue is given, we need to get its corresponding
gendisk to get the private data stored in that gendisk.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
---
include/linux/blkdev.h | 2 ++
include/trace/events/kyber.h | 6 +++---
2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f94ee3089e01..2a802e011a95 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -687,6 +687,8 @@ static inline bool blk_account_rq(struct request *rq)
dma_map_page_attrs(dev, (bv)->bv_page, (bv)->bv_offset, (bv)->bv_len, \
(dir), (attrs))
+#define queue_to_disk(q) (dev_to_disk(kobj_to_dev((q)->kobj.parent)))
+
static inline bool queue_is_mq(struct request_queue *q)
{
return q->mq_ops;
diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
index c0e7d24ca256..f9802562edf6 100644
--- a/include/trace/events/kyber.h
+++ b/include/trace/events/kyber.h
@@ -30,7 +30,7 @@ TRACE_EVENT(kyber_latency,
),
TP_fast_assign(
- __entry->dev = disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+ __entry->dev = disk_devt(queue_to_disk(q));
strlcpy(__entry->domain, domain, sizeof(__entry->domain));
strlcpy(__entry->type, type, sizeof(__entry->type));
__entry->percentile = percentile;
@@ -59,7 +59,7 @@ TRACE_EVENT(kyber_adjust,
),
TP_fast_assign(
- __entry->dev = disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+ __entry->dev = disk_devt(queue_to_disk(q));
strlcpy(__entry->domain, domain, sizeof(__entry->domain));
__entry->depth = depth;
),
@@ -81,7 +81,7 @@ TRACE_EVENT(kyber_throttled,
),
TP_fast_assign(
- __entry->dev = disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+ __entry->dev = disk_devt(queue_to_disk(q));
strlcpy(__entry->domain, domain, sizeof(__entry->domain));
),
--
2.27.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 03/11] block: add poll method to support bio-based IO polling
2021-02-08 8:52 [PATCH v3 00/11] dm: support IO polling Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 01/11] block: move definition of blk_qc_t to types.h Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 02/11] block: add queue_to_disk() to get gendisk from request_queue Jeffle Xu
@ 2021-02-08 8:52 ` Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 04/11] block: add poll_capable " Jeffle Xu
` (9 subsequent siblings)
12 siblings, 0 replies; 27+ messages in thread
From: Jeffle Xu @ 2021-02-08 8:52 UTC (permalink / raw)
To: snitzer, axboe; +Cc: joseph.qi, caspar, hch, linux-block, dm-devel, io-uring
->poll_fn was introduced in commit ea435e1b9392 ("block: add a poll_fn
callback to struct request_queue") to support bio-based queues such as
nvme multipath, but was later removed in commit 529262d56dbe ("block:
remove ->poll_fn").
Given commit c62b37d96b6e ("block: move ->make_request_fn to struct
block_device_operations") restore the possibility of bio-based IO
polling support by adding an ->poll method to gendisk->fops.
Make blk_mq_poll() specific to blk-mq, while blk_bio_poll() is
originally a copy from blk_mq_poll(), and is specific to bio-based
polling.
Signed-off-by: Jeffle Xu <[email protected]>
---
block/blk-core.c | 58 ++++++++++++++++++++++++++++++++++++++++++
block/blk-mq.c | 22 +---------------
include/linux/blk-mq.h | 1 +
include/linux/blkdev.h | 1 +
4 files changed, 61 insertions(+), 21 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 7663a9b94b80..37aa513da5f2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1132,6 +1132,64 @@ blk_qc_t submit_bio(struct bio *bio)
}
EXPORT_SYMBOL(submit_bio);
+
+static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+{
+ long state;
+ struct gendisk *disk = queue_to_disk(q);
+
+ state = current->state;
+ do {
+ int ret;
+
+ ret = disk->fops->poll(q, cookie);
+ if (ret > 0) {
+ __set_current_state(TASK_RUNNING);
+ return ret;
+ }
+
+ if (signal_pending_state(state, current))
+ __set_current_state(TASK_RUNNING);
+
+ if (current->state == TASK_RUNNING)
+ return 1;
+ if (ret < 0 || !spin)
+ break;
+ cpu_relax();
+ } while (!need_resched());
+
+ __set_current_state(TASK_RUNNING);
+ return 0;
+}
+
+/**
+ * blk_poll - poll for IO completions
+ * @q: the queue
+ * @cookie: cookie passed back at IO submission time
+ * @spin: whether to spin for completions
+ *
+ * Description:
+ * Poll for completions on the passed in queue. Returns number of
+ * completed entries found. If @spin is true, then blk_poll will continue
+ * looping until at least one completion is found, unless the task is
+ * otherwise marked running (or we need to reschedule).
+ */
+int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+{
+ if (!blk_qc_t_valid(cookie) ||
+ !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+ return 0;
+
+ if (current->plug)
+ blk_flush_plug_list(current->plug, false);
+
+ if (queue_is_mq(q))
+ return blk_mq_poll(q, cookie, spin);
+ else
+ return blk_bio_poll(q, cookie, spin);
+}
+EXPORT_SYMBOL_GPL(blk_poll);
+
/**
* blk_cloned_rq_check_limits - Helper function to check a cloned request
* for the new queue limits
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f285a9123a8b..c18bc23834a0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3826,30 +3826,11 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
return blk_mq_poll_hybrid_sleep(q, rq);
}
-/**
- * blk_poll - poll for IO completions
- * @q: the queue
- * @cookie: cookie passed back at IO submission time
- * @spin: whether to spin for completions
- *
- * Description:
- * Poll for completions on the passed in queue. Returns number of
- * completed entries found. If @spin is true, then blk_poll will continue
- * looping until at least one completion is found, unless the task is
- * otherwise marked running (or we need to reschedule).
- */
-int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
{
struct blk_mq_hw_ctx *hctx;
long state;
- if (!blk_qc_t_valid(cookie) ||
- !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
- return 0;
-
- if (current->plug)
- blk_flush_plug_list(current->plug, false);
-
hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
/*
@@ -3891,7 +3872,6 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
__set_current_state(TASK_RUNNING);
return 0;
}
-EXPORT_SYMBOL_GPL(blk_poll);
unsigned int blk_mq_rq_cpu(struct request *rq)
{
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d705b174d346..adc789d5888e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -607,6 +607,7 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
}
blk_qc_t blk_mq_submit_bio(struct bio *bio);
+int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin);
void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
struct lock_class_key *key);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2a802e011a95..ebe16f55cba4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1853,6 +1853,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { }
struct block_device_operations {
blk_qc_t (*submit_bio) (struct bio *bio);
+ int (*poll)(struct request_queue *q, blk_qc_t cookie);
int (*open) (struct block_device *, fmode_t);
void (*release) (struct gendisk *, fmode_t);
int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
--
2.27.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 04/11] block: add poll_capable method to support bio-based IO polling
2021-02-08 8:52 [PATCH v3 00/11] dm: support IO polling Jeffle Xu
` (2 preceding siblings ...)
2021-02-08 8:52 ` [PATCH v3 03/11] block: add poll method to support bio-based IO polling Jeffle Xu
@ 2021-02-08 8:52 ` Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 05/11] block/mq: extract one helper function polling hw queue Jeffle Xu
` (8 subsequent siblings)
12 siblings, 0 replies; 27+ messages in thread
From: Jeffle Xu @ 2021-02-08 8:52 UTC (permalink / raw)
To: snitzer, axboe; +Cc: joseph.qi, caspar, hch, linux-block, dm-devel, io-uring
This method can be used to check if bio-based device supports IO polling
or not. For mq devices, checking for hw queue in polling mode is
adequate, while the sanity check shall be implementation specific for
bio-based devices. For example, dm device needs to check if all
underlying devices are capable of IO polling.
Though bio-based device may have done the sanity check during the
device initialization phase, cacheing the result of this sanity check
(such as by cacheing in the queue_flags) may not work. Because for dm
devices, users could change the state of the underlying devices through
'/sys/block/<dev>/io_poll', bypassing the dm device above. In this case,
the cached result of the very beginning sanity check could be
out-of-date. Thus the sanity check needs to be done every time 'io_poll'
is to be modified.
Signed-off-by: Jeffle Xu <[email protected]>
---
block/blk-sysfs.c | 14 +++++++++++---
include/linux/blkdev.h | 1 +
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b513f1683af0..f11fedefc37d 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -420,9 +420,17 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
unsigned long poll_on;
ssize_t ret;
- if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
- !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
- return -EINVAL;
+ if (queue_is_mq(q)) {
+ if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
+ !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
+ return -EINVAL;
+ } else {
+ struct gendisk *disk = queue_to_disk(q);
+
+ if (!disk->fops->poll_capable ||
+ !disk->fops->poll_capable(disk))
+ return -EINVAL;
+ }
ret = queue_var_store(&poll_on, page, count);
if (ret < 0)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ebe16f55cba4..8a84088642ce 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1854,6 +1854,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { }
struct block_device_operations {
blk_qc_t (*submit_bio) (struct bio *bio);
int (*poll)(struct request_queue *q, blk_qc_t cookie);
+ bool (*poll_capable)(struct gendisk *disk);
int (*open) (struct block_device *, fmode_t);
void (*release) (struct gendisk *, fmode_t);
int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
--
2.27.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 05/11] block/mq: extract one helper function polling hw queue
2021-02-08 8:52 [PATCH v3 00/11] dm: support IO polling Jeffle Xu
` (3 preceding siblings ...)
2021-02-08 8:52 ` [PATCH v3 04/11] block: add poll_capable " Jeffle Xu
@ 2021-02-08 8:52 ` Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 06/11] block/mq: add iterator for polling hw queues Jeffle Xu
` (7 subsequent siblings)
12 siblings, 0 replies; 27+ messages in thread
From: Jeffle Xu @ 2021-02-08 8:52 UTC (permalink / raw)
To: snitzer, axboe; +Cc: joseph.qi, caspar, hch, linux-block, dm-devel, io-uring
Extract the logic of polling one hw queue and related statistics
handling out as the helper function.
Signed-off-by: Jeffle Xu <[email protected]>
---
block/blk-mq.c | 5 +----
include/linux/blk-mq.h | 13 +++++++++++++
2 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c18bc23834a0..b4de2b37b826 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3850,11 +3850,8 @@ int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
do {
int ret;
- hctx->poll_invoked++;
-
- ret = q->mq_ops->poll(hctx);
+ ret = blk_mq_poll_hctx(q, hctx);
if (ret > 0) {
- hctx->poll_success++;
__set_current_state(TASK_RUNNING);
return ret;
}
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index adc789d5888e..cf1910c6d5ae 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -606,6 +606,19 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
rq->rq_disk = bio->bi_disk;
}
+static inline int blk_mq_poll_hctx(struct request_queue *q,
+ struct blk_mq_hw_ctx *hctx)
+{
+ int ret;
+
+ hctx->poll_invoked++;
+ ret = q->mq_ops->poll(hctx);
+ if (ret > 0)
+ hctx->poll_success++;
+
+ return ret;
+}
+
blk_qc_t blk_mq_submit_bio(struct bio *bio);
int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin);
void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
--
2.27.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 06/11] block/mq: add iterator for polling hw queues
2021-02-08 8:52 [PATCH v3 00/11] dm: support IO polling Jeffle Xu
` (4 preceding siblings ...)
2021-02-08 8:52 ` [PATCH v3 05/11] block/mq: extract one helper function polling hw queue Jeffle Xu
@ 2021-02-08 8:52 ` Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 07/11] dm: always return BLK_QC_T_NONE for bio-based device Jeffle Xu
` (6 subsequent siblings)
12 siblings, 0 replies; 27+ messages in thread
From: Jeffle Xu @ 2021-02-08 8:52 UTC (permalink / raw)
To: snitzer, axboe; +Cc: joseph.qi, caspar, hch, linux-block, dm-devel, io-uring
Add one helper function for iterating all hardware queues in polling
mode.
Signed-off-by: Jeffle Xu <[email protected]>
---
include/linux/blk-mq.h | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index cf1910c6d5ae..0c511555ec16 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -574,6 +574,13 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
for ((i) = 0; (i) < (q)->nr_hw_queues && \
({ hctx = (q)->queue_hw_ctx[i]; 1; }); (i)++)
+#define queue_for_each_poll_hw_ctx(q, hctx, i) \
+ for ((i) = 0; ((q)->tag_set->nr_maps > HCTX_TYPE_POLL) && \
+ (i) < (q)->tag_set->map[HCTX_TYPE_POLL].nr_queues && \
+ ({ int __idx = (q)->tag_set->map[HCTX_TYPE_POLL].queue_offset + (i); \
+ hctx = (q)->queue_hw_ctx[__idx]; 1; }); \
+ (i)++)
+
#define hctx_for_each_ctx(hctx, ctx, i) \
for ((i) = 0; (i) < (hctx)->nr_ctx && \
({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
--
2.27.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 07/11] dm: always return BLK_QC_T_NONE for bio-based device
2021-02-08 8:52 [PATCH v3 00/11] dm: support IO polling Jeffle Xu
` (5 preceding siblings ...)
2021-02-08 8:52 ` [PATCH v3 06/11] block/mq: add iterator for polling hw queues Jeffle Xu
@ 2021-02-08 8:52 ` Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 08/11] dm: fix iterate_device sanity check Jeffle Xu
` (5 subsequent siblings)
12 siblings, 0 replies; 27+ messages in thread
From: Jeffle Xu @ 2021-02-08 8:52 UTC (permalink / raw)
To: snitzer, axboe; +Cc: joseph.qi, caspar, hch, linux-block, dm-devel, io-uring
Currently the returned cookie of bio-based device is not used at all.
Cookie of bio-based device will be refactored in the following patch.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
---
drivers/md/dm.c | 26 ++++++++++----------------
1 file changed, 10 insertions(+), 16 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7bac564f3faa..46ca3b739396 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1252,14 +1252,13 @@ void dm_accept_partial_bio(struct bio *bio, unsigned n_sectors)
}
EXPORT_SYMBOL_GPL(dm_accept_partial_bio);
-static blk_qc_t __map_bio(struct dm_target_io *tio)
+static void __map_bio(struct dm_target_io *tio)
{
int r;
sector_t sector;
struct bio *clone = &tio->clone;
struct dm_io *io = tio->io;
struct dm_target *ti = tio->ti;
- blk_qc_t ret = BLK_QC_T_NONE;
clone->bi_end_io = clone_endio;
@@ -1278,7 +1277,7 @@ static blk_qc_t __map_bio(struct dm_target_io *tio)
case DM_MAPIO_REMAPPED:
/* the bio has been remapped so dispatch it */
trace_block_bio_remap(clone, bio_dev(io->orig_bio), sector);
- ret = submit_bio_noacct(clone);
+ submit_bio_noacct(clone);
break;
case DM_MAPIO_KILL:
free_tio(tio);
@@ -1292,8 +1291,6 @@ static blk_qc_t __map_bio(struct dm_target_io *tio)
DMWARN("unimplemented target map return value: %d", r);
BUG();
}
-
- return ret;
}
static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len)
@@ -1380,7 +1377,7 @@ static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
}
}
-static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci,
+static void __clone_and_map_simple_bio(struct clone_info *ci,
struct dm_target_io *tio, unsigned *len)
{
struct bio *clone = &tio->clone;
@@ -1391,7 +1388,7 @@ static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci,
if (len)
bio_setup_sector(clone, ci->sector, *len);
- return __map_bio(tio);
+ __map_bio(tio);
}
static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
@@ -1405,7 +1402,7 @@ static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
while ((bio = bio_list_pop(&blist))) {
tio = container_of(bio, struct dm_target_io, clone);
- (void) __clone_and_map_simple_bio(ci, tio, len);
+ __clone_and_map_simple_bio(ci, tio, len);
}
}
@@ -1450,7 +1447,7 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
free_tio(tio);
return r;
}
- (void) __map_bio(tio);
+ __map_bio(tio);
return 0;
}
@@ -1565,11 +1562,10 @@ static void init_clone_info(struct clone_info *ci, struct mapped_device *md,
/*
* Entry point to split a bio into clones and submit them to the targets.
*/
-static blk_qc_t __split_and_process_bio(struct mapped_device *md,
+static void __split_and_process_bio(struct mapped_device *md,
struct dm_table *map, struct bio *bio)
{
struct clone_info ci;
- blk_qc_t ret = BLK_QC_T_NONE;
int error = 0;
init_clone_info(&ci, md, map, bio);
@@ -1613,7 +1609,7 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md,
bio_chain(b, bio);
trace_block_split(b, bio->bi_iter.bi_sector);
- ret = submit_bio_noacct(bio);
+ submit_bio_noacct(bio);
break;
}
}
@@ -1621,13 +1617,11 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md,
/* drop the extra reference count */
dec_pending(ci.io, errno_to_blk_status(error));
- return ret;
}
static blk_qc_t dm_submit_bio(struct bio *bio)
{
struct mapped_device *md = bio->bi_disk->private_data;
- blk_qc_t ret = BLK_QC_T_NONE;
int srcu_idx;
struct dm_table *map;
@@ -1657,10 +1651,10 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
if (is_abnormal_io(bio))
blk_queue_split(&bio);
- ret = __split_and_process_bio(md, map, bio);
+ __split_and_process_bio(md, map, bio);
out:
dm_put_live_table(md, srcu_idx);
- return ret;
+ return BLK_QC_T_NONE;
}
/*-----------------------------------------------------------------
--
2.27.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 08/11] dm: fix iterate_device sanity check
2021-02-08 8:52 [PATCH v3 00/11] dm: support IO polling Jeffle Xu
` (6 preceding siblings ...)
2021-02-08 8:52 ` [PATCH v3 07/11] dm: always return BLK_QC_T_NONE for bio-based device Jeffle Xu
@ 2021-02-08 8:52 ` Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 09/11] dm: support IO polling for bio-based dm device Jeffle Xu
` (4 subsequent siblings)
12 siblings, 0 replies; 27+ messages in thread
From: Jeffle Xu @ 2021-02-08 8:52 UTC (permalink / raw)
To: snitzer, axboe; +Cc: joseph.qi, caspar, hch, linux-block, dm-devel, io-uring
According to the definition of dm_iterate_devices_fn:
* This function must iterate through each section of device used by the
* target until it encounters a non-zero return code, which it then returns.
* Returns zero if no callout returned non-zero.
For some target type (e.g., dm-stripe), one call of iterate_devices()
may iterate multiple underlying devices internally, in which case a
non-zero return code returned by iterate_devices_callout_fn will stop
the iteration in advance.
Thus if we want to ensure that _all_ underlying devices support some
kind of attribute, the iteration structure like
dm_table_supports_nowait() should be used, while the input
iterate_devices_callout_fn should handle the 'not support' semantics.
On the opposite, the iteration structure like dm_table_any_dev_attr()
should be used if _any_ underlying device supporting this attibute is
sufficient. In this case, the input iterate_devices_callout_fn should
handle the 'support' semantics.
Besides, remove the unnecessary NULL pointer check for request_queue,
since the request_queue pointer returned from bdev_get_queue() shall
never be NULL according to commit ff9ea323816d ("block, bdi: an active
gendisk always has a request_queue associated with it").
Fixes: 545ed20e6df6 ("dm: add infrastructure for DAX support")
Fixes: c3c4555edd10 ("dm table: clear add_random unless all devices have it set")
Fixes: 4693c9668fdc ("dm table: propagate non rotational flag")
Cc: [email protected]
Signed-off-by: Jeffle Xu <[email protected]>
---
drivers/md/dm-table.c | 195 +++++++++++++++++++-----------------------
drivers/md/dm.c | 2 +-
drivers/md/dm.h | 2 +-
3 files changed, 91 insertions(+), 108 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 4acf2342f7ad..aa37f3e82238 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -820,24 +820,24 @@ void dm_table_set_type(struct dm_table *t, enum dm_queue_mode type)
EXPORT_SYMBOL_GPL(dm_table_set_type);
/* validate the dax capability of the target device span */
-int device_supports_dax(struct dm_target *ti, struct dm_dev *dev,
- sector_t start, sector_t len, void *data)
+int device_not_dax_capable(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
{
int blocksize = *(int *) data, id;
bool rc;
id = dax_read_lock();
- rc = dax_supported(dev->dax_dev, dev->bdev, blocksize, start, len);
+ rc = !dax_supported(dev->dax_dev, dev->bdev, blocksize, start, len);
dax_read_unlock(id);
return rc;
}
/* Check devices support synchronous DAX */
-static int device_dax_synchronous(struct dm_target *ti, struct dm_dev *dev,
- sector_t start, sector_t len, void *data)
+static int device_not_dax_sync_capable(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
{
- return dev->dax_dev && dax_synchronous(dev->dax_dev);
+ return !dev->dax_dev || !dax_synchronous(dev->dax_dev);
}
bool dm_table_supports_dax(struct dm_table *t,
@@ -854,7 +854,7 @@ bool dm_table_supports_dax(struct dm_table *t,
return false;
if (!ti->type->iterate_devices ||
- !ti->type->iterate_devices(ti, iterate_fn, blocksize))
+ ti->type->iterate_devices(ti, iterate_fn, blocksize))
return false;
}
@@ -925,7 +925,7 @@ static int dm_table_determine_type(struct dm_table *t)
verify_bio_based:
/* We must use this table as bio-based */
t->type = DM_TYPE_BIO_BASED;
- if (dm_table_supports_dax(t, device_supports_dax, &page_size) ||
+ if (dm_table_supports_dax(t, device_not_dax_capable, &page_size) ||
(list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) {
t->type = DM_TYPE_DAX_BIO_BASED;
}
@@ -1331,13 +1331,49 @@ bool dm_table_has_no_data_devices(struct dm_table *table)
return true;
}
-static int device_is_zoned_model(struct dm_target *ti, struct dm_dev *dev,
- sector_t start, sector_t len, void *data)
+/*
+ * Cases requiring _any_ underlying device supporting some kind of attribute,
+ * should use the iteration structure like dm_table_any_dev_attr(), or call
+ * it directly. @func should handle semantics of positive examples, e.g.,
+ * capable of something.
+ */
+static bool dm_table_any_dev_attr(struct dm_table *t,
+ iterate_devices_callout_fn func, void *data)
+{
+ struct dm_target *ti;
+ unsigned int i;
+
+ for (i = 0; i < dm_table_get_num_targets(t); i++) {
+ ti = dm_table_get_target(t, i);
+
+ if (ti->type->iterate_devices &&
+ ti->type->iterate_devices(ti, func, data))
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Cases requiring _all_ underlying devices supporting some kind of attribute,
+ * should use the iteration structure like dm_table_supports_nowait(), or call
+ * dm_table_all_devs_attr() directly. @anti_func should handle semantics of
+ * counter examples, e.g., not capable of something.
+ */
+static inline bool dm_table_all_devs_attr(struct dm_table *t,
+ iterate_devices_callout_fn anti_func,
+ void *data)
+{
+ return !dm_table_any_dev_attr(t, anti_func, data);
+}
+
+static int device_not_zoned_model(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
{
struct request_queue *q = bdev_get_queue(dev->bdev);
enum blk_zoned_model *zoned_model = data;
- return q && blk_queue_zoned_model(q) == *zoned_model;
+ return blk_queue_zoned_model(q) != *zoned_model;
}
static bool dm_table_supports_zoned_model(struct dm_table *t,
@@ -1354,37 +1390,20 @@ static bool dm_table_supports_zoned_model(struct dm_table *t,
return false;
if (!ti->type->iterate_devices ||
- !ti->type->iterate_devices(ti, device_is_zoned_model, &zoned_model))
+ ti->type->iterate_devices(ti, device_not_zoned_model, &zoned_model))
return false;
}
return true;
}
-static int device_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
- sector_t start, sector_t len, void *data)
+static int device_not_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
{
struct request_queue *q = bdev_get_queue(dev->bdev);
unsigned int *zone_sectors = data;
- return q && blk_queue_zone_sectors(q) == *zone_sectors;
-}
-
-static bool dm_table_matches_zone_sectors(struct dm_table *t,
- unsigned int zone_sectors)
-{
- struct dm_target *ti;
- unsigned i;
-
- for (i = 0; i < dm_table_get_num_targets(t); i++) {
- ti = dm_table_get_target(t, i);
-
- if (!ti->type->iterate_devices ||
- !ti->type->iterate_devices(ti, device_matches_zone_sectors, &zone_sectors))
- return false;
- }
-
- return true;
+ return blk_queue_zone_sectors(q) != *zone_sectors;
}
static int validate_hardware_zoned_model(struct dm_table *table,
@@ -1404,7 +1423,7 @@ static int validate_hardware_zoned_model(struct dm_table *table,
if (!zone_sectors || !is_power_of_2(zone_sectors))
return -EINVAL;
- if (!dm_table_matches_zone_sectors(table, zone_sectors)) {
+ if (dm_table_any_dev_attr(table, device_not_matches_zone_sectors, &zone_sectors)) {
DMERR("%s: zone sectors is not consistent across all devices",
dm_device_name(table->md));
return -EINVAL;
@@ -1533,7 +1552,7 @@ static int device_flush_capable(struct dm_target *ti, struct dm_dev *dev,
unsigned long flush = (unsigned long) data;
struct request_queue *q = bdev_get_queue(dev->bdev);
- return q && (q->queue_flags & flush);
+ return (q->queue_flags & flush);
}
static bool dm_table_supports_flush(struct dm_table *t, unsigned long flush)
@@ -1578,54 +1597,20 @@ static int device_dax_write_cache_enabled(struct dm_target *ti,
return false;
}
-static int dm_table_supports_dax_write_cache(struct dm_table *t)
-{
- struct dm_target *ti;
- unsigned i;
-
- for (i = 0; i < dm_table_get_num_targets(t); i++) {
- ti = dm_table_get_target(t, i);
-
- if (ti->type->iterate_devices &&
- ti->type->iterate_devices(ti,
- device_dax_write_cache_enabled, NULL))
- return true;
- }
-
- return false;
-}
-
-static int device_is_nonrot(struct dm_target *ti, struct dm_dev *dev,
- sector_t start, sector_t len, void *data)
+static int device_is_rot(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
{
struct request_queue *q = bdev_get_queue(dev->bdev);
- return q && blk_queue_nonrot(q);
+ return !blk_queue_nonrot(q);
}
static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev,
- sector_t start, sector_t len, void *data)
+ sector_t start, sector_t len, void *data)
{
struct request_queue *q = bdev_get_queue(dev->bdev);
- return q && !blk_queue_add_random(q);
-}
-
-static bool dm_table_all_devices_attribute(struct dm_table *t,
- iterate_devices_callout_fn func)
-{
- struct dm_target *ti;
- unsigned i;
-
- for (i = 0; i < dm_table_get_num_targets(t); i++) {
- ti = dm_table_get_target(t, i);
-
- if (!ti->type->iterate_devices ||
- !ti->type->iterate_devices(ti, func, NULL))
- return false;
- }
-
- return true;
+ return !blk_queue_add_random(q);
}
static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev,
@@ -1633,7 +1618,7 @@ static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *de
{
struct request_queue *q = bdev_get_queue(dev->bdev);
- return q && !q->limits.max_write_same_sectors;
+ return !q->limits.max_write_same_sectors;
}
static bool dm_table_supports_write_same(struct dm_table *t)
@@ -1660,7 +1645,7 @@ static int device_not_write_zeroes_capable(struct dm_target *ti, struct dm_dev *
{
struct request_queue *q = bdev_get_queue(dev->bdev);
- return q && !q->limits.max_write_zeroes_sectors;
+ return !q->limits.max_write_zeroes_sectors;
}
static bool dm_table_supports_write_zeroes(struct dm_table *t)
@@ -1687,7 +1672,7 @@ static int device_not_nowait_capable(struct dm_target *ti, struct dm_dev *dev,
{
struct request_queue *q = bdev_get_queue(dev->bdev);
- return q && !blk_queue_nowait(q);
+ return !blk_queue_nowait(q);
}
static bool dm_table_supports_nowait(struct dm_table *t)
@@ -1714,7 +1699,7 @@ static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
{
struct request_queue *q = bdev_get_queue(dev->bdev);
- return q && !blk_queue_discard(q);
+ return !blk_queue_discard(q);
}
static bool dm_table_supports_discards(struct dm_table *t)
@@ -1748,7 +1733,7 @@ static int device_not_secure_erase_capable(struct dm_target *ti,
{
struct request_queue *q = bdev_get_queue(dev->bdev);
- return q && !blk_queue_secure_erase(q);
+ return !blk_queue_secure_erase(q);
}
static bool dm_table_supports_secure_erase(struct dm_table *t)
@@ -1776,30 +1761,23 @@ static int device_requires_stable_pages(struct dm_target *ti,
{
struct request_queue *q = bdev_get_queue(dev->bdev);
- return q && blk_queue_stable_writes(q);
+ return blk_queue_stable_writes(q);
}
/*
- * If any underlying device requires stable pages, a table must require
- * them as well. Only targets that support iterate_devices are considered:
- * don't want error, zero, etc to require stable pages.
+ * type->iterate_devices() should be called when the sanity check needs to
+ * iterate and check all underlying data devices. iterate_devices() will
+ * iterate all underlying data devices until it encounters a non-zero return
+ * code, returned by whether the input iterate_devices_callout_fn, or
+ * iterate_devices() itself internally.
+ *
+ * For some target type (e.g., dm-stripe), one call of iterate_devices() may
+ * iterate multiple underlying devices internally, in which case a non-zero
+ * return code returned by iterate_devices_callout_fn will stop the iteration
+ * in advance.
+ * Thus users should call dm_table_any_dev_attr() or dm_table_all_devs_attr()
+ * directly if possible. See comments of these two functions for more details.
*/
-static bool dm_table_requires_stable_pages(struct dm_table *t)
-{
- struct dm_target *ti;
- unsigned i;
-
- for (i = 0; i < dm_table_get_num_targets(t); i++) {
- ti = dm_table_get_target(t, i);
-
- if (ti->type->iterate_devices &&
- ti->type->iterate_devices(ti, device_requires_stable_pages, NULL))
- return true;
- }
-
- return false;
-}
-
void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
struct queue_limits *limits)
{
@@ -1837,22 +1815,22 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
}
blk_queue_write_cache(q, wc, fua);
- if (dm_table_supports_dax(t, device_supports_dax, &page_size)) {
+ if (dm_table_supports_dax(t, device_not_dax_capable, &page_size)) {
blk_queue_flag_set(QUEUE_FLAG_DAX, q);
- if (dm_table_supports_dax(t, device_dax_synchronous, NULL))
+ if (dm_table_supports_dax(t, device_not_dax_sync_capable, NULL))
set_dax_synchronous(t->md->dax_dev);
}
else
blk_queue_flag_clear(QUEUE_FLAG_DAX, q);
- if (dm_table_supports_dax_write_cache(t))
+ if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
dax_write_cache(t->md->dax_dev, true);
/* Ensure that all underlying devices are non-rotational. */
- if (dm_table_all_devices_attribute(t, device_is_nonrot))
- blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
- else
+ if (dm_table_any_dev_attr(t, device_is_rot, NULL))
blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
+ else
+ blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
if (!dm_table_supports_write_same(t))
q->limits.max_write_same_sectors = 0;
@@ -1864,8 +1842,11 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
/*
* Some devices don't use blk_integrity but still want stable pages
* because they do their own checksumming.
+ * If any underlying device requires stable pages, a table must require
+ * them as well. Only targets that support iterate_devices are considered:
+ * don't want error, zero, etc to require stable pages.
*/
- if (dm_table_requires_stable_pages(t))
+ if (dm_table_any_dev_attr(t, device_requires_stable_pages, NULL))
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, q);
else
blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, q);
@@ -1876,8 +1857,10 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
* Clear QUEUE_FLAG_ADD_RANDOM if any underlying device does not
* have it set.
*/
- if (blk_queue_add_random(q) && dm_table_all_devices_attribute(t, device_is_not_random))
+ if (dm_table_any_dev_attr(t, device_is_not_random, NULL))
blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
+ else
+ blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q);
/*
* For a zoned target, the number of zones should be updated for the
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 46ca3b739396..c2945c90745e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1128,7 +1128,7 @@ static bool dm_dax_supported(struct dax_device *dax_dev, struct block_device *bd
if (!map)
goto out;
- ret = dm_table_supports_dax(map, device_supports_dax, &blocksize);
+ ret = dm_table_supports_dax(map, device_not_dax_capable, &blocksize);
out:
dm_put_live_table(md, srcu_idx);
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index fffe1e289c53..b441ad772c18 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -73,7 +73,7 @@ void dm_table_free_md_mempools(struct dm_table *t);
struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t);
bool dm_table_supports_dax(struct dm_table *t, iterate_devices_callout_fn fn,
int *blocksize);
-int device_supports_dax(struct dm_target *ti, struct dm_dev *dev,
+int device_not_dax_capable(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data);
void dm_lock_md_type(struct mapped_device *md);
--
2.27.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 09/11] dm: support IO polling for bio-based dm device
2021-02-08 8:52 [PATCH v3 00/11] dm: support IO polling Jeffle Xu
` (7 preceding siblings ...)
2021-02-08 8:52 ` [PATCH v3 08/11] dm: fix iterate_device sanity check Jeffle Xu
@ 2021-02-08 8:52 ` Jeffle Xu
2021-02-09 3:11 ` Ming Lei
2021-02-19 14:17 ` [dm-devel] " Mikulas Patocka
2021-02-08 8:52 ` [PATCH v3 10/11] nvme/pci: don't wait for locked polling queue Jeffle Xu
` (3 subsequent siblings)
12 siblings, 2 replies; 27+ messages in thread
From: Jeffle Xu @ 2021-02-08 8:52 UTC (permalink / raw)
To: snitzer, axboe; +Cc: joseph.qi, caspar, hch, linux-block, dm-devel, io-uring
DM will iterate and poll all polling hardware queues of all target mq
devices when polling IO for dm device. To mitigate the race introduced
by iterating all target hw queues, a per-hw-queue flag is maintained
to indicate whether this polling hw queue currently being polled on or
not. Every polling hw queue is exclusive to one polling instance, i.e.,
the polling instance will skip this polling hw queue if this hw queue
currently is being polled by another polling instance, and start
polling on the next hw queue.
IO polling is enabled when all underlying target devices are capable
of IO polling. The sanity check supports the stacked device model, in
which one dm device may be build upon another dm device. In this case,
the mapped device will check if the underlying dm target device
supports IO polling.
Signed-off-by: Jeffle Xu <[email protected]>
---
drivers/md/dm-table.c | 26 ++++++++++++++
drivers/md/dm.c | 64 +++++++++++++++++++++++++++++++++++
include/linux/device-mapper.h | 1 +
3 files changed, 91 insertions(+)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index aa37f3e82238..b090b4c9692d 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1764,6 +1764,19 @@ static int device_requires_stable_pages(struct dm_target *ti,
return blk_queue_stable_writes(q);
}
+static int device_not_poll_capable(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
+{
+ struct request_queue *q = bdev_get_queue(dev->bdev);
+
+ return !test_bit(QUEUE_FLAG_POLL, &q->queue_flags);
+}
+
+int dm_table_supports_poll(struct dm_table *t)
+{
+ return dm_table_all_devs_attr(t, device_not_poll_capable, NULL);
+}
+
/*
* type->iterate_devices() should be called when the sanity check needs to
* iterate and check all underlying data devices. iterate_devices() will
@@ -1875,6 +1888,19 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
#endif
blk_queue_update_readahead(q);
+
+ /*
+ * Check for request-based device is remained to
+ * dm_mq_init_request_queue()->blk_mq_init_allocated_queue().
+ * For bio-based device, only set QUEUE_FLAG_POLL when all underlying
+ * devices supporting polling.
+ */
+ if (__table_type_bio_based(t->type)) {
+ if (dm_table_supports_poll(t))
+ blk_queue_flag_set(QUEUE_FLAG_POLL, q);
+ else
+ blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
+ }
}
unsigned int dm_table_get_num_targets(struct dm_table *t)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index c2945c90745e..8423f1347bb8 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1657,6 +1657,68 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
return BLK_QC_T_NONE;
}
+static int dm_poll_one_md(struct mapped_device *md);
+
+static int dm_poll_one_dev(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
+{
+ int i, *count = data;
+ struct request_queue *q = bdev_get_queue(dev->bdev);
+ struct blk_mq_hw_ctx *hctx;
+
+ if (queue_is_mq(q)) {
+ if (!percpu_ref_tryget(&q->q_usage_counter))
+ return 0;
+
+ queue_for_each_poll_hw_ctx(q, hctx, i)
+ *count += blk_mq_poll_hctx(q, hctx);
+
+ percpu_ref_put(&q->q_usage_counter);
+ } else
+ *count += dm_poll_one_md(dev->bdev->bd_disk->private_data);
+
+ return 0;
+}
+
+static int dm_poll_one_md(struct mapped_device *md)
+{
+ int i, srcu_idx, ret = 0;
+ struct dm_table *t;
+ struct dm_target *ti;
+
+ t = dm_get_live_table(md, &srcu_idx);
+
+ for (i = 0; i < dm_table_get_num_targets(t); i++) {
+ ti = dm_table_get_target(t, i);
+ ti->type->iterate_devices(ti, dm_poll_one_dev, &ret);
+ }
+
+ dm_put_live_table(md, srcu_idx);
+
+ return ret;
+}
+
+static int dm_bio_poll(struct request_queue *q, blk_qc_t cookie)
+{
+ struct gendisk *disk = queue_to_disk(q);
+ struct mapped_device *md = disk->private_data;
+
+ return dm_poll_one_md(md);
+}
+
+static bool dm_bio_poll_capable(struct gendisk *disk)
+{
+ int ret, srcu_idx;
+ struct mapped_device *md = disk->private_data;
+ struct dm_table *t;
+
+ t = dm_get_live_table(md, &srcu_idx);
+ ret = dm_table_supports_poll(t);
+ dm_put_live_table(md, srcu_idx);
+
+ return ret;
+}
+
/*-----------------------------------------------------------------
* An IDR is used to keep track of allocated minor numbers.
*---------------------------------------------------------------*/
@@ -3049,6 +3111,8 @@ static const struct pr_ops dm_pr_ops = {
};
static const struct block_device_operations dm_blk_dops = {
+ .poll = dm_bio_poll,
+ .poll_capable = dm_bio_poll_capable,
.submit_bio = dm_submit_bio,
.open = dm_blk_open,
.release = dm_blk_close,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 61a66fb8ebb3..6a9de3fd0087 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -515,6 +515,7 @@ unsigned int dm_table_get_num_targets(struct dm_table *t);
fmode_t dm_table_get_mode(struct dm_table *t);
struct mapped_device *dm_table_get_md(struct dm_table *t);
const char *dm_table_device_name(struct dm_table *t);
+int dm_table_supports_poll(struct dm_table *t);
/*
* Trigger an event.
--
2.27.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 10/11] nvme/pci: don't wait for locked polling queue
2021-02-08 8:52 [PATCH v3 00/11] dm: support IO polling Jeffle Xu
` (8 preceding siblings ...)
2021-02-08 8:52 ` [PATCH v3 09/11] dm: support IO polling for bio-based dm device Jeffle Xu
@ 2021-02-08 8:52 ` Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 11/11] dm: fastpath of bio-based polling Jeffle Xu
` (2 subsequent siblings)
12 siblings, 0 replies; 27+ messages in thread
From: Jeffle Xu @ 2021-02-08 8:52 UTC (permalink / raw)
To: snitzer, axboe; +Cc: joseph.qi, caspar, hch, linux-block, dm-devel, io-uring
There's no sense waiting for the hw queue when it currently has been
locked by another polling instance. The polling instance currently
occupying the hw queue will help reap the completion events.
It shall be safe to surrender the hw queue, as long as we could reapply
for polling later. For Synchronous polling, blk_poll() will reapply for
polling, since @spin is always True in this case. While For asynchronous
polling, i.e. io_uring itself will reapply for polling when the previous
polling returns 0.
Besides, it shall do no harm to the polling performance of mq devices.
Signed-off-by: Jeffle Xu <[email protected]>
---
drivers/nvme/host/pci.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 81e6389b2042..db164fcd04e2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1106,7 +1106,9 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx)
if (!nvme_cqe_pending(nvmeq))
return 0;
- spin_lock(&nvmeq->cq_poll_lock);
+ if (!spin_trylock(&nvmeq->cq_poll_lock))
+ return 0;
+
found = nvme_process_cq(nvmeq);
spin_unlock(&nvmeq->cq_poll_lock);
--
2.27.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 11/11] dm: fastpath of bio-based polling
2021-02-08 8:52 [PATCH v3 00/11] dm: support IO polling Jeffle Xu
` (9 preceding siblings ...)
2021-02-08 8:52 ` [PATCH v3 10/11] nvme/pci: don't wait for locked polling queue Jeffle Xu
@ 2021-02-08 8:52 ` Jeffle Xu
2021-02-19 19:38 ` [dm-devel] " Mikulas Patocka
2021-02-17 13:15 ` [PATCH v3 00/11] dm: support IO polling JeffleXu
2021-03-10 20:01 ` Mike Snitzer
12 siblings, 1 reply; 27+ messages in thread
From: Jeffle Xu @ 2021-02-08 8:52 UTC (permalink / raw)
To: snitzer, axboe; +Cc: joseph.qi, caspar, hch, linux-block, dm-devel, io-uring
Offer one fastpath of bio-based polling when bio submitted to dm device
is not split.
In this case, there will be only one bio submitted to only one polling
hw queue of one underlying mq device, and thus we don't need to track
all split bios or iterate through all polling hw queues. The pointer to
the polling hw queue the bio submitted to is returned here as the
returned cookie. In this case, the polling routine will call
mq_ops->poll() directly with the hw queue converted from the input
cookie.
If the original bio submitted to dm device is split to multiple bios and
thus submitted to multiple polling hw queues, the bio submission routine
will return BLK_QC_T_BIO_MULTI, while the polling routine will fall
back to iterating all hw queues (in polling mode) of all underlying mq
devices.
Signed-off-by: Jeffle Xu <[email protected]>
---
block/blk-core.c | 33 +++++++++++++++++++++++++++++++--
include/linux/blk_types.h | 8 ++++++++
include/linux/types.h | 2 +-
3 files changed, 40 insertions(+), 3 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 37aa513da5f2..cb24b33a4870 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -956,11 +956,19 @@ static blk_qc_t __submit_bio(struct bio *bio)
* bio_list_on_stack[0] contains bios submitted by the current ->submit_bio.
* bio_list_on_stack[1] contains bios that were submitted before the current
* ->submit_bio_bio, but that haven't been processed yet.
+ *
+ * Return:
+ * - BLK_QC_T_NONE, no need for IO polling.
+ * - BLK_QC_T_BIO_MULTI, @bio gets split and enqueued into multi hw queues.
+ * - Otherwise, @bio is not split, returning the pointer to the corresponding
+ * hw queue that the bio enqueued into as the returned cookie.
*/
static blk_qc_t __submit_bio_noacct(struct bio *bio)
{
struct bio_list bio_list_on_stack[2];
blk_qc_t ret = BLK_QC_T_NONE;
+ struct request_queue *top_q = bio->bi_disk->queue;
+ bool poll_on = test_bit(QUEUE_FLAG_POLL, &top_q->queue_flags);
BUG_ON(bio->bi_next);
@@ -968,6 +976,7 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
current->bio_list = bio_list_on_stack;
do {
+ blk_qc_t cookie;
struct request_queue *q = bio->bi_disk->queue;
struct bio_list lower, same;
@@ -980,7 +989,20 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
bio_list_on_stack[1] = bio_list_on_stack[0];
bio_list_init(&bio_list_on_stack[0]);
- ret = __submit_bio(bio);
+ cookie = __submit_bio(bio);
+
+ if (poll_on &&
+ blk_qc_t_bio_valid(ret) && blk_qc_t_valid(cookie)) {
+ unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
+ struct blk_mq_hw_ctx *hctx = q->queue_hw_ctx[queue_num];
+
+ cookie = (blk_qc_t)hctx;
+
+ if (!blk_qc_t_valid(ret)) /* set initial value */
+ ret = cookie;
+ else if (ret != cookie) /* bio got split */
+ ret = BLK_QC_T_BIO_MULTI;
+ }
/*
* Sort new bios into those for a lower level and those for the
@@ -1003,6 +1025,7 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
} while ((bio = bio_list_pop(&bio_list_on_stack[0])));
current->bio_list = NULL;
+
return ret;
}
@@ -1142,7 +1165,13 @@ static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
do {
int ret;
- ret = disk->fops->poll(q, cookie);
+ if (blk_qc_t_bio_valid(cookie)) {
+ struct blk_mq_hw_ctx *hctx = (struct blk_mq_hw_ctx *)cookie;
+ struct request_queue *target_q = hctx->queue;
+
+ ret = blk_mq_poll_hctx(target_q, hctx);
+ } else
+ ret = disk->fops->poll(q, cookie);
if (ret > 0) {
__set_current_state(TASK_RUNNING);
return ret;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 2e05244fc16d..4173754532c0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -557,6 +557,14 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
return (cookie & BLK_QC_T_INTERNAL) != 0;
}
+/* Macros for blk_qc_t used for bio-based polling */
+#define BLK_QC_T_BIO_MULTI -2U
+
+static inline bool blk_qc_t_bio_valid(blk_qc_t cookie)
+{
+ return cookie != BLK_QC_T_BIO_MULTI;
+}
+
struct blk_rq_stat {
u64 mean;
u64 min;
diff --git a/include/linux/types.h b/include/linux/types.h
index da5ca7e1bea9..f6301014a459 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -126,7 +126,7 @@ typedef u64 sector_t;
typedef u64 blkcnt_t;
/* cookie used for IO polling */
-typedef unsigned int blk_qc_t;
+typedef uintptr_t blk_qc_t;
/*
* The type of an index into the pagecache.
--
2.27.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v3 09/11] dm: support IO polling for bio-based dm device
2021-02-08 8:52 ` [PATCH v3 09/11] dm: support IO polling for bio-based dm device Jeffle Xu
@ 2021-02-09 3:11 ` Ming Lei
2021-02-09 6:13 ` JeffleXu
2021-02-19 14:17 ` [dm-devel] " Mikulas Patocka
1 sibling, 1 reply; 27+ messages in thread
From: Ming Lei @ 2021-02-09 3:11 UTC (permalink / raw)
To: Jeffle Xu
Cc: snitzer, axboe, joseph.qi, caspar, hch, linux-block, dm-devel,
io-uring
On Mon, Feb 08, 2021 at 04:52:41PM +0800, Jeffle Xu wrote:
> DM will iterate and poll all polling hardware queues of all target mq
> devices when polling IO for dm device. To mitigate the race introduced
> by iterating all target hw queues, a per-hw-queue flag is maintained
What is the per-hw-queue flag?
> to indicate whether this polling hw queue currently being polled on or
> not. Every polling hw queue is exclusive to one polling instance, i.e.,
> the polling instance will skip this polling hw queue if this hw queue
> currently is being polled by another polling instance, and start
> polling on the next hw queue.
Not see such skip in dm_poll_one_dev() in which
queue_for_each_poll_hw_ctx() is called directly for polling all POLL
hctxs of the request queue, so can you explain it a bit more about this
skip mechanism?
Even though such skipping is implemented, not sure if good performance
can be reached because hctx poll may be done in ping-pong style
among several CPUs. But blk-mq hctx is supposed to have its cpu affinities.
--
Ming
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 09/11] dm: support IO polling for bio-based dm device
2021-02-09 3:11 ` Ming Lei
@ 2021-02-09 6:13 ` JeffleXu
2021-02-09 6:22 ` JeffleXu
2021-02-09 8:07 ` Ming Lei
0 siblings, 2 replies; 27+ messages in thread
From: JeffleXu @ 2021-02-09 6:13 UTC (permalink / raw)
To: Ming Lei
Cc: snitzer, axboe, joseph.qi, caspar, hch, linux-block, dm-devel,
io-uring
On 2/9/21 11:11 AM, Ming Lei wrote:
> On Mon, Feb 08, 2021 at 04:52:41PM +0800, Jeffle Xu wrote:
>> DM will iterate and poll all polling hardware queues of all target mq
>> devices when polling IO for dm device. To mitigate the race introduced
>> by iterating all target hw queues, a per-hw-queue flag is maintained
>
> What is the per-hw-queue flag?
Sorry I forgot to update the commit message as the implementation
changed. Actually this mechanism is implemented by patch 10 of this
patch set.
>
>> to indicate whether this polling hw queue currently being polled on or
>> not. Every polling hw queue is exclusive to one polling instance, i.e.,
>> the polling instance will skip this polling hw queue if this hw queue
>> currently is being polled by another polling instance, and start
>> polling on the next hw queue.
>
> Not see such skip in dm_poll_one_dev() in which
> queue_for_each_poll_hw_ctx() is called directly for polling all POLL
> hctxs of the request queue, so can you explain it a bit more about this
> skip mechanism?
>
It is implemented as patch 10 of this patch set. When spin_trylock()
fails, the polling instance will return immediately, instead of busy
waiting.
> Even though such skipping is implemented, not sure if good performance
> can be reached because hctx poll may be done in ping-pong style
> among several CPUs. But blk-mq hctx is supposed to have its cpu affinities.
>
Yes, the mechanism of iterating all hw queues can make the competition
worse.
If every underlying data device has **only** one polling hw queue, then
this ping-pong style polling still exist, even when we implement split
bio tracking mechanism, i.e., acquiring the specific hw queue the bio
enqueued into. Because multiple polling instance has to compete for the
only polling hw queue.
But if multiple polling hw queues per device are reserved for multiple
polling instances, (e.g., every underlying data device has 3 polling hw
queues when there are 3 polling instances), just as what we practice on
mq polling, then the current implementation of iterating all hw queues
will indeed works in a ping-pong style, while this issue shall not exist
when accurate split bio tracking mechanism could be implemented.
As for the performance, I cite the test results here, as summarized in
the cover-letter
(https://lore.kernel.org/io-uring/[email protected]/)
| IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
----------- | --------------- | -------------------- | ----
without opt | 318k | 256k | ~-20%
with opt | 314k | 354k | ~13%
The 'opt' refers to the optimization of patch 10, i.e., the skipping
mechanism. There are 3 polling instances (i.e., 3 CPUs) in this test case.
Indeed the current implementation of iterating all hw queues is some
sort of compromise, as I find it really difficult to implement the
accurate split bio mechanism, and to achieve high performance at the
same time. Thus I turn to optimizing the original implementation of
iterating all hw queues, such as optimization of patch 10 and 11.
--
Thanks,
Jeffle
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 09/11] dm: support IO polling for bio-based dm device
2021-02-09 6:13 ` JeffleXu
@ 2021-02-09 6:22 ` JeffleXu
2021-02-09 8:07 ` Ming Lei
1 sibling, 0 replies; 27+ messages in thread
From: JeffleXu @ 2021-02-09 6:22 UTC (permalink / raw)
To: Ming Lei
Cc: snitzer, axboe, joseph.qi, caspar, hch, linux-block, dm-devel,
io-uring
On 2/9/21 2:13 PM, JeffleXu wrote:
>
>
> On 2/9/21 11:11 AM, Ming Lei wrote:
>> On Mon, Feb 08, 2021 at 04:52:41PM +0800, Jeffle Xu wrote:
>>> DM will iterate and poll all polling hardware queues of all target mq
>>> devices when polling IO for dm device. To mitigate the race introduced
>>> by iterating all target hw queues, a per-hw-queue flag is maintained
>>
>> What is the per-hw-queue flag?
>
> Sorry I forgot to update the commit message as the implementation
> changed. Actually this mechanism is implemented by patch 10 of this
> patch set.
>
>>
>>> to indicate whether this polling hw queue currently being polled on or
>>> not. Every polling hw queue is exclusive to one polling instance, i.e.,
>>> the polling instance will skip this polling hw queue if this hw queue
>>> currently is being polled by another polling instance, and start
>>> polling on the next hw queue.
>>
>> Not see such skip in dm_poll_one_dev() in which
>> queue_for_each_poll_hw_ctx() is called directly for polling all POLL
>> hctxs of the request queue, so can you explain it a bit more about this
>> skip mechanism?
>>
>
> It is implemented as patch 10 of this patch set. When spin_trylock()
> fails, the polling instance will return immediately, instead of busy
> waiting.
>
>
>> Even though such skipping is implemented, not sure if good performance
>> can be reached because hctx poll may be done in ping-pong style
>> among several CPUs. But blk-mq hctx is supposed to have its cpu affinities.
>>
>
> Yes, the mechanism of iterating all hw queues can make the competition
> worse.
>
> If every underlying data device has **only** one polling hw queue, then
> this ping-pong style polling still exist, even when we implement split
> bio tracking mechanism, i.e., acquiring the specific hw queue the bio
> enqueued into. Because multiple polling instance has to compete for the
> only polling hw queue.
>
> But if multiple polling hw queues per device are reserved for multiple
> polling instances, (e.g., every underlying data device has 3 polling hw
> queues when there are 3 polling instances), just as what we practice on
> mq polling, then the current implementation of iterating all hw queues
> will indeed works in a ping-pong style, while this issue shall not exist
> when accurate split bio tracking mechanism could be implemented.
>
If not considering process migration, I could somehow avoid iterating
all hw queues, still in the framework of the current implementation. For
example. For example, the CPU number of the IO submitting process could
be stored in the cookie, while the polling routine will only iterate hw
queues to which the stored CPU number maps. Just a temporary insight
though ....
--
Thanks,
Jeffle
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 09/11] dm: support IO polling for bio-based dm device
2021-02-09 6:13 ` JeffleXu
2021-02-09 6:22 ` JeffleXu
@ 2021-02-09 8:07 ` Ming Lei
2021-02-09 8:46 ` JeffleXu
1 sibling, 1 reply; 27+ messages in thread
From: Ming Lei @ 2021-02-09 8:07 UTC (permalink / raw)
To: JeffleXu
Cc: snitzer, axboe, joseph.qi, caspar, hch, linux-block, dm-devel,
io-uring
On Tue, Feb 09, 2021 at 02:13:38PM +0800, JeffleXu wrote:
>
>
> On 2/9/21 11:11 AM, Ming Lei wrote:
> > On Mon, Feb 08, 2021 at 04:52:41PM +0800, Jeffle Xu wrote:
> >> DM will iterate and poll all polling hardware queues of all target mq
> >> devices when polling IO for dm device. To mitigate the race introduced
> >> by iterating all target hw queues, a per-hw-queue flag is maintained
> >
> > What is the per-hw-queue flag?
>
> Sorry I forgot to update the commit message as the implementation
> changed. Actually this mechanism is implemented by patch 10 of this
> patch set.
It is hard to associate patch 10's spin_trylock() with per-hw-queue
flag. Also scsi's poll implementation is in-progress, and scsi's poll may
not be implemented in this way.
>
> >
> >> to indicate whether this polling hw queue currently being polled on or
> >> not. Every polling hw queue is exclusive to one polling instance, i.e.,
> >> the polling instance will skip this polling hw queue if this hw queue
> >> currently is being polled by another polling instance, and start
> >> polling on the next hw queue.
> >
> > Not see such skip in dm_poll_one_dev() in which
> > queue_for_each_poll_hw_ctx() is called directly for polling all POLL
> > hctxs of the request queue, so can you explain it a bit more about this
> > skip mechanism?
> >
>
> It is implemented as patch 10 of this patch set. When spin_trylock()
> fails, the polling instance will return immediately, instead of busy
> waiting.
>
>
> > Even though such skipping is implemented, not sure if good performance
> > can be reached because hctx poll may be done in ping-pong style
> > among several CPUs. But blk-mq hctx is supposed to have its cpu affinities.
> >
>
> Yes, the mechanism of iterating all hw queues can make the competition
> worse.
>
> If every underlying data device has **only** one polling hw queue, then
> this ping-pong style polling still exist, even when we implement split
> bio tracking mechanism, i.e., acquiring the specific hw queue the bio
> enqueued into. Because multiple polling instance has to compete for the
> only polling hw queue.
>
> But if multiple polling hw queues per device are reserved for multiple
> polling instances, (e.g., every underlying data device has 3 polling hw
> queues when there are 3 polling instances), just as what we practice on
> mq polling, then the current implementation of iterating all hw queues
> will indeed works in a ping-pong style, while this issue shall not exist
> when accurate split bio tracking mechanism could be implemented.
In reality it could be possible to have one hw queue for each numa node.
And you may re-use blk_mq_map_queue() for getting the proper hw queue for poll.
--
Ming
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 09/11] dm: support IO polling for bio-based dm device
2021-02-09 8:07 ` Ming Lei
@ 2021-02-09 8:46 ` JeffleXu
0 siblings, 0 replies; 27+ messages in thread
From: JeffleXu @ 2021-02-09 8:46 UTC (permalink / raw)
To: Ming Lei
Cc: snitzer, axboe, joseph.qi, caspar, hch, linux-block, dm-devel,
io-uring
On 2/9/21 4:07 PM, Ming Lei wrote:
> On Tue, Feb 09, 2021 at 02:13:38PM +0800, JeffleXu wrote:
>>
>>
>> On 2/9/21 11:11 AM, Ming Lei wrote:
>>> On Mon, Feb 08, 2021 at 04:52:41PM +0800, Jeffle Xu wrote:
>>>> DM will iterate and poll all polling hardware queues of all target mq
>>>> devices when polling IO for dm device. To mitigate the race introduced
>>>> by iterating all target hw queues, a per-hw-queue flag is maintained
>>>
>>> What is the per-hw-queue flag?
>>
>> Sorry I forgot to update the commit message as the implementation
>> changed. Actually this mechanism is implemented by patch 10 of this
>> patch set.
>
> It is hard to associate patch 10's spin_trylock() with per-hw-queue
> flag.
You're right, the commit message here is totally a mistake. Actually I
had ever implemented a per-hw-queue flag, such as
```
struct blk_mq_hw_ctx {
atomic_t busy;
...
};
```
In this case, the skipping mechanism is implemented in block layer.
But later I refactor the code and move the implementation to the device
driver layer as described by patch 10, while forgetting to update the
commit message. The reason why I implement it in device driver layer is
that, the competition actually stems from the underlying device driver
(e.g., nvme driver), as described in the following snippet.
```
nvme_poll
spin_lock(&nvmeq->cq_poll_lock);
found = nvme_process_cq(nvmeq);
spin_unlock(&nvmeq->cq_poll_lock);
```
It is @nvmeq->cq_poll_lock, i.e., the implementation of the underlying
device driver that has caused the competition. Thus maybe it is
reasonable to handle the competition issue in the device driver layer?
> Also scsi's poll implementation is in-progress, and scsi's poll may
> not be implemented in this way.
Yes. The defect of leaving the competition issue to the device driver
layer is that, every device driver supporting polling need to be somehow
optimized individually. Actually I have not taken a close look at the
other two types of nvme driver (drivers/nvme/host/tcp.c and
drivers/nvme/host/rdma.c), which also support polling.
>>
>>>
>>>> to indicate whether this polling hw queue currently being polled on or
>>>> not. Every polling hw queue is exclusive to one polling instance, i.e.,
>>>> the polling instance will skip this polling hw queue if this hw queue
>>>> currently is being polled by another polling instance, and start
>>>> polling on the next hw queue.
>>>
>>> Not see such skip in dm_poll_one_dev() in which
>>> queue_for_each_poll_hw_ctx() is called directly for polling all POLL
>>> hctxs of the request queue, so can you explain it a bit more about this
>>> skip mechanism?
>>>
>>
>> It is implemented as patch 10 of this patch set. When spin_trylock()
>> fails, the polling instance will return immediately, instead of busy
>> waiting.
>>
>>
>>> Even though such skipping is implemented, not sure if good performance
>>> can be reached because hctx poll may be done in ping-pong style
>>> among several CPUs. But blk-mq hctx is supposed to have its cpu affinities.
>>>
>>
>> Yes, the mechanism of iterating all hw queues can make the competition
>> worse.
>>
>> If every underlying data device has **only** one polling hw queue, then
>> this ping-pong style polling still exist, even when we implement split
>> bio tracking mechanism, i.e., acquiring the specific hw queue the bio
>> enqueued into. Because multiple polling instance has to compete for the
>> only polling hw queue.
>>
>> But if multiple polling hw queues per device are reserved for multiple
>> polling instances, (e.g., every underlying data device has 3 polling hw
>> queues when there are 3 polling instances), just as what we practice on
>> mq polling, then the current implementation of iterating all hw queues
>> will indeed works in a ping-pong style, while this issue shall not exist
>> when accurate split bio tracking mechanism could be implemented.
>
> In reality it could be possible to have one hw queue for each numa node.
>
> And you may re-use blk_mq_map_queue() for getting the proper hw queue for poll.
Thanks. But the optimization I proposed in [1] may not works well when
the IO submitting process migrates to another CPU halfway. I mean, the
process has submitted several split bios, and then it migrates to
another CPU and moves on submitting the left split bios.
[1]
https://lore.kernel.org/io-uring/[email protected]/T/#m0d9a0e55e11874a70c6a3491f191289df72a84f8
--
Thanks,
Jeffle
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 00/11] dm: support IO polling
2021-02-08 8:52 [PATCH v3 00/11] dm: support IO polling Jeffle Xu
` (10 preceding siblings ...)
2021-02-08 8:52 ` [PATCH v3 11/11] dm: fastpath of bio-based polling Jeffle Xu
@ 2021-02-17 13:15 ` JeffleXu
2021-03-10 20:01 ` Mike Snitzer
12 siblings, 0 replies; 27+ messages in thread
From: JeffleXu @ 2021-02-17 13:15 UTC (permalink / raw)
To: snitzer, axboe; +Cc: joseph.qi, caspar, hch, linux-block, dm-devel, io-uring
Hi,
Any comment on this series?
Thanks,
Jeffle
On 2/8/21 4:52 PM, Jeffle Xu wrote:
>
> [Performance]
> 1. One thread (numjobs=1) randread (bs=4k, direct=1) one dm-linear
> device, which is built upon 3 nvme devices, with one polling hw
> queue per nvme device.
>
> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---- | --------------- | -------------------- | ----
> dm | 208k | 279k | ~34%
>
>
> 2. Three threads (numjobs=3) randread (bs=4k, direct=1) one dm-linear
> device, which is built upon 3 nvme devices, with one polling hw
> queue per nvme device.
>
> It's compared to 3 threads directly randread 3 nvme devices, with one
> polling hw queue per nvme device. No CPU affinity set for these 3
> threads. Thus every thread can access every nvme device
> (filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread
> need to competing for every polling hw queue.
>
> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---- | --------------- | -------------------- | ----
> dm | 615k | 728k | ~18%
> nvme | 728k | 873k | ~20%
>
> The result shows that the performance gain of bio-based polling is
> comparable as that of mq polling in the same test case.
>
>
> 3. Three threads (numjobs=3) randread (bs=12k, direct=1) one
> **dm-stripe** device, which is built upon 3 nvme devices, with one
> polling hw queue per nvme device.
>
> It's compared to 3 threads directly randread 3 nvme devices, with one
> polling hw queue per nvme device. No CPU affinity set for these 3
> threads. Thus every thread can access every nvme device
> (filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread
> need to competing for every polling hw queue.
>
> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---- | --------------- | -------------------- | ----
> dm | 314k | 354k | ~13%
> nvme | 728k | 873k | ~20%
>
>
> 4. This patchset shall do no harm to the performance of the original mq
> polling. Following is the test results of one thread (numjobs=1)
> randread (bs=4k, direct=1) one nvme device.
>
> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---------------- | --------------- | -------------------- | ----
> without patchset | 242k | 332k | ~39%
> with patchset | 236k | 332k | ~39%
>
>
>
> [Changes since v2]
>
> Patchset v2 caches all hw queues (in polling mode) of underlying mq
> devices in dm layer. The polling routine actually iterates through all
> these cached hw queues.
>
> However, mq may change the queue mapping at runtime (e.g., NVMe RESET
> command), thus the cached hw queues in dm layer may be out-of-date. Thus
> patchset v3 falls back to the implementation of the very first RFC
> version, in which the mq layer needs to export one interface iterating
> all polling hw queues (patch 5), and the bio-based polling routine just
> calls this interface to iterate all polling hw queues.
>
> Besides, several new optimization is proposed.
>
>
> - patch 1,2,7
> same as v2, untouched
>
> - patch 3
> Considering advice from Christoph Hellwig, while refactoring blk_poll(),
> split mq and bio-based polling routine from the very beginning. Now
> blk_poll() is just a simple entry. blk_bio_poll() is simply copied from
> blk_mq_poll(), while the loop structure is some sort of duplication
> though.
>
> - patch 4
> This patch is newly added to support turning on/off polling through
> '/sys/block/<dev>/queue/io_poll' dynamiclly for bio-based devices.
> Patchset v2 implemented this functionality by added one new queue flag,
> which is not preferred since the queue flag resource is quite short of
> nowadays.
>
> - patch 5
> This patch is newly added, preparing for the following bio-based
> polling. The following bio-based polling will call this helper function,
> accounting on the corresponding hw queue.
>
> - patch 6
> It's from the very first RFC version, preparing for the following
> bio-based polling.
>
> - patch 8
> One fixing patch needed by the following bio-based polling. It's
> actually a v2 of [1]. I had sent the v2 singly in-reply-to [1], though
> it has not been visible on the mailing list maybe due to the delay.
>
> - patch 9
> It's from the very first RFC version.
>
> - patch 10
> This patch is newly added. Patchset v2 had ever proposed one
> optimization that, skipping the **busy** hw queues during the iteration
> phase. Back upon that time, one flag of 'atomic_t' is specifically
> maintained in dm layer, representing if the corresponding hw queue is
> busy or not. The idea is inherited, while the implementation changes.
> Now @nvmeq->cq_poll_lock is used directly here, no need for extra flag
> anymore.
>
> This optimization can significantly reduce the competition for one hw
> queue between multiple polling instances. Following statistics is the
> test result when 3 threads concurrently randread (bs=4k, direct=1) one
> dm-linear device, which is built upon 3 nvme devices, with one polling
> hw queue per nvme device.
>
> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ----------- | --------------- | -------------------- | ----
> without opt | 318k | 256k | ~-20%
> with opt | 314k | 354k | ~13%
>
>
> - patch 11
> This is another newly added optimizatin for bio-based polling.
>
> One intuitive insight is that, when the original bio submitted to dm
> device doesn't get split, then the bio gets enqueued into only one hw
> queue of one of the underlying mq devices. In this case, we no longer
> need to track all split bios, and one cookie (for the only split bio)
> is enough. It is implemented by returning the pointer to the
> corresponding hw queue in this case.
>
> It should be safe by directly returning the pointer to the hw queue,
> since 'struct blk_mq_hw_ctx' won't be freed during the whole lifetime of
> 'struct request_queue'. Even when the number of hw queues may decrease
> when NVMe RESET happens, the 'struct request_queue' structure of decreased
> hw queues won't be freed, instead it's buffered into
> &q->unused_hctx_list list.
>
> Though this optimization seems quite intuitive, the performance test
> shows that it does no benefit nor harm to the performance, while 3
> threads concurrently randreading (bs=4k, direct=1) one dm-linear
> device, which is built upon 3 nvme devices, with one polling hw queue
> per nvme device.
>
> I'm not sure why it doesn't work, maybe because the number of devices,
> or the depth of the devcice stack is to low in my test case?
>
>
> [Remained Issue]
> It has been mentioned in patch 4 that, users could change the state of
> the underlying devices through '/sys/block/<dev>/io_poll', bypassing
> the dm device above. Thus it can cause a situation where QUEUE_FLAG_POLL
> is still set for the request_queue of dm device, while one of the
> underlying mq device may has cleared this flag.
>
> In this case, it will pass the 'test_bit(QUEUE_FLAG_POLL, &q->queue_flags)'
> check in blk_poll(), while the input cookie may actually points to a hw
> queue in IRQ mode since patch 11. Thus for this hw queue (in IRQ mode),
> the bio-based polling routine will handle this hw queue acquiring
> 'spin_lock(&nvmeq->cq_poll_lock)' (refer
> drivers/nvme/host/pci.c:nvme_poll), which is not adequate since this hw
> queue may also be accessed in IRQ context. In other words,
> spin_lock_irq() should be used here.
>
> I have not come up one simple way to fix it. I don't want to do sanity
> check (e.g., the type of the hw queue is HCTX_TYPE_POLL or not) in the
> IO path (submit_bio()/blk_poll()), i.e., fast path.
>
> We'd better fix it in the control path, i.e., dm could be aware of the
> change when attribute (e.g., support io_poll or not) of one of the
> underlying devices changed at runtime.
>
> [1] https://listman.redhat.com/archives/dm-devel/2021-February/msg00028.html
>
>
> changes since v1:
> - patch 1,2,4 is the same as v1 and have already been reviewed
> - patch 3 is refactored a bit on the basis of suggestions from
> Mike Snitzer.
> - patch 5 is newly added and introduces one new queue flag
> representing if the queue is capable of IO polling. This mainly
> simplifies the logic in queue_poll_store().
> - patch 6 implements the core mechanism supporting IO polling.
> The sanity check checking if the dm device supports IO polling is
> also folded into this patch, and the queue flag will be cleared if
> it doesn't support, in case of table reloading.
>
>
>
>
> Jeffle Xu (11):
> block: move definition of blk_qc_t to types.h
> block: add queue_to_disk() to get gendisk from request_queue
> block: add poll method to support bio-based IO polling
> block: add poll_capable method to support bio-based IO polling
> block/mq: extract one helper function polling hw queue
> block/mq: add iterator for polling hw queues
> dm: always return BLK_QC_T_NONE for bio-based device
> dm: fix iterate_device sanity check
> dm: support IO polling for bio-based dm device
> nvme/pci: don't wait for locked polling queue
> dm: fastpath of bio-based polling
>
> block/blk-core.c | 89 +++++++++++++-
> block/blk-mq.c | 27 +----
> block/blk-sysfs.c | 14 ++-
> drivers/md/dm-table.c | 215 ++++++++++++++++++----------------
> drivers/md/dm.c | 90 +++++++++++---
> drivers/md/dm.h | 2 +-
> drivers/nvme/host/pci.c | 4 +-
> include/linux/blk-mq.h | 21 ++++
> include/linux/blk_types.h | 10 +-
> include/linux/blkdev.h | 4 +
> include/linux/device-mapper.h | 1 +
> include/linux/fs.h | 2 +-
> include/linux/types.h | 3 +
> include/trace/events/kyber.h | 6 +-
> 14 files changed, 333 insertions(+), 155 deletions(-)
>
--
Thanks,
Jeffle
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [dm-devel] [PATCH v3 09/11] dm: support IO polling for bio-based dm device
2021-02-08 8:52 ` [PATCH v3 09/11] dm: support IO polling for bio-based dm device Jeffle Xu
2021-02-09 3:11 ` Ming Lei
@ 2021-02-19 14:17 ` Mikulas Patocka
2021-02-24 1:42 ` JeffleXu
2021-02-26 8:22 ` JeffleXu
1 sibling, 2 replies; 27+ messages in thread
From: Mikulas Patocka @ 2021-02-19 14:17 UTC (permalink / raw)
To: Jeffle Xu
Cc: snitzer, axboe, caspar, hch, linux-block, joseph.qi, dm-devel,
io-uring
On Mon, 8 Feb 2021, Jeffle Xu wrote:
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index c2945c90745e..8423f1347bb8 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1657,6 +1657,68 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
> return BLK_QC_T_NONE;
> }
>
> +static int dm_poll_one_md(struct mapped_device *md);
> +
> +static int dm_poll_one_dev(struct dm_target *ti, struct dm_dev *dev,
> + sector_t start, sector_t len, void *data)
> +{
> + int i, *count = data;
> + struct request_queue *q = bdev_get_queue(dev->bdev);
> + struct blk_mq_hw_ctx *hctx;
> +
> + if (queue_is_mq(q)) {
> + if (!percpu_ref_tryget(&q->q_usage_counter))
> + return 0;
> +
> + queue_for_each_poll_hw_ctx(q, hctx, i)
> + *count += blk_mq_poll_hctx(q, hctx);
> +
> + percpu_ref_put(&q->q_usage_counter);
> + } else
> + *count += dm_poll_one_md(dev->bdev->bd_disk->private_data);
This is fragile, because in the future there may be other bio-based
drivers that support polling. You should check that "q" is really a device
mapper device before calling dm_poll_one_md on it.
Mikulas
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [dm-devel] [PATCH v3 11/11] dm: fastpath of bio-based polling
2021-02-08 8:52 ` [PATCH v3 11/11] dm: fastpath of bio-based polling Jeffle Xu
@ 2021-02-19 19:38 ` Mikulas Patocka
2021-02-26 8:12 ` JeffleXu
0 siblings, 1 reply; 27+ messages in thread
From: Mikulas Patocka @ 2021-02-19 19:38 UTC (permalink / raw)
To: Jeffle Xu
Cc: snitzer, axboe, caspar, hch, linux-block, joseph.qi, dm-devel,
io-uring
On Mon, 8 Feb 2021, Jeffle Xu wrote:
> Offer one fastpath of bio-based polling when bio submitted to dm device
> is not split.
>
> In this case, there will be only one bio submitted to only one polling
> hw queue of one underlying mq device, and thus we don't need to track
> all split bios or iterate through all polling hw queues. The pointer to
> the polling hw queue the bio submitted to is returned here as the
> returned cookie.
This doesn't seem safe - note that between submit_bio() and blk_poll(), no
locks are held - so the device mapper device may be reconfigured
arbitrarily. When you call blk_poll() with a pointer returned by
submit_bio(), the pointer may point to a stale address.
Mikulas
> In this case, the polling routine will call
> mq_ops->poll() directly with the hw queue converted from the input
> cookie.
>
> If the original bio submitted to dm device is split to multiple bios and
> thus submitted to multiple polling hw queues, the bio submission routine
> will return BLK_QC_T_BIO_MULTI, while the polling routine will fall
> back to iterating all hw queues (in polling mode) of all underlying mq
> devices.
>
> Signed-off-by: Jeffle Xu <[email protected]>
> ---
> block/blk-core.c | 33 +++++++++++++++++++++++++++++++--
> include/linux/blk_types.h | 8 ++++++++
> include/linux/types.h | 2 +-
> 3 files changed, 40 insertions(+), 3 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 37aa513da5f2..cb24b33a4870 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -956,11 +956,19 @@ static blk_qc_t __submit_bio(struct bio *bio)
> * bio_list_on_stack[0] contains bios submitted by the current ->submit_bio.
> * bio_list_on_stack[1] contains bios that were submitted before the current
> * ->submit_bio_bio, but that haven't been processed yet.
> + *
> + * Return:
> + * - BLK_QC_T_NONE, no need for IO polling.
> + * - BLK_QC_T_BIO_MULTI, @bio gets split and enqueued into multi hw queues.
> + * - Otherwise, @bio is not split, returning the pointer to the corresponding
> + * hw queue that the bio enqueued into as the returned cookie.
> */
> static blk_qc_t __submit_bio_noacct(struct bio *bio)
> {
> struct bio_list bio_list_on_stack[2];
> blk_qc_t ret = BLK_QC_T_NONE;
> + struct request_queue *top_q = bio->bi_disk->queue;
> + bool poll_on = test_bit(QUEUE_FLAG_POLL, &top_q->queue_flags);
>
> BUG_ON(bio->bi_next);
>
> @@ -968,6 +976,7 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> current->bio_list = bio_list_on_stack;
>
> do {
> + blk_qc_t cookie;
> struct request_queue *q = bio->bi_disk->queue;
> struct bio_list lower, same;
>
> @@ -980,7 +989,20 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> bio_list_on_stack[1] = bio_list_on_stack[0];
> bio_list_init(&bio_list_on_stack[0]);
>
> - ret = __submit_bio(bio);
> + cookie = __submit_bio(bio);
> +
> + if (poll_on &&
> + blk_qc_t_bio_valid(ret) && blk_qc_t_valid(cookie)) {
> + unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
> + struct blk_mq_hw_ctx *hctx = q->queue_hw_ctx[queue_num];
> +
> + cookie = (blk_qc_t)hctx;
> +
> + if (!blk_qc_t_valid(ret)) /* set initial value */
> + ret = cookie;
> + else if (ret != cookie) /* bio got split */
> + ret = BLK_QC_T_BIO_MULTI;
> + }
>
> /*
> * Sort new bios into those for a lower level and those for the
> @@ -1003,6 +1025,7 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> } while ((bio = bio_list_pop(&bio_list_on_stack[0])));
>
> current->bio_list = NULL;
> +
> return ret;
> }
>
> @@ -1142,7 +1165,13 @@ static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> do {
> int ret;
>
> - ret = disk->fops->poll(q, cookie);
> + if (blk_qc_t_bio_valid(cookie)) {
> + struct blk_mq_hw_ctx *hctx = (struct blk_mq_hw_ctx *)cookie;
> + struct request_queue *target_q = hctx->queue;
> +
> + ret = blk_mq_poll_hctx(target_q, hctx);
> + } else
> + ret = disk->fops->poll(q, cookie);
> if (ret > 0) {
> __set_current_state(TASK_RUNNING);
> return ret;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 2e05244fc16d..4173754532c0 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -557,6 +557,14 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
> return (cookie & BLK_QC_T_INTERNAL) != 0;
> }
>
> +/* Macros for blk_qc_t used for bio-based polling */
> +#define BLK_QC_T_BIO_MULTI -2U
> +
> +static inline bool blk_qc_t_bio_valid(blk_qc_t cookie)
> +{
> + return cookie != BLK_QC_T_BIO_MULTI;
> +}
> +
> struct blk_rq_stat {
> u64 mean;
> u64 min;
> diff --git a/include/linux/types.h b/include/linux/types.h
> index da5ca7e1bea9..f6301014a459 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -126,7 +126,7 @@ typedef u64 sector_t;
> typedef u64 blkcnt_t;
>
> /* cookie used for IO polling */
> -typedef unsigned int blk_qc_t;
> +typedef uintptr_t blk_qc_t;
>
> /*
> * The type of an index into the pagecache.
> --
> 2.27.0
>
> --
> dm-devel mailing list
> [email protected]
> https://listman.redhat.com/mailman/listinfo/dm-devel
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [dm-devel] [PATCH v3 09/11] dm: support IO polling for bio-based dm device
2021-02-19 14:17 ` [dm-devel] " Mikulas Patocka
@ 2021-02-24 1:42 ` JeffleXu
2021-02-26 8:22 ` JeffleXu
1 sibling, 0 replies; 27+ messages in thread
From: JeffleXu @ 2021-02-24 1:42 UTC (permalink / raw)
To: Mikulas Patocka
Cc: axboe, snitzer, caspar, io-uring, linux-block, joseph.qi,
dm-devel, hch
On 2/19/21 10:17 PM, Mikulas Patocka wrote:
>
>
> On Mon, 8 Feb 2021, Jeffle Xu wrote:
>
>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>> index c2945c90745e..8423f1347bb8 100644
>> --- a/drivers/md/dm.c
>> +++ b/drivers/md/dm.c
>> @@ -1657,6 +1657,68 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
>> return BLK_QC_T_NONE;
>> }
>>
>> +static int dm_poll_one_md(struct mapped_device *md);
>> +
>> +static int dm_poll_one_dev(struct dm_target *ti, struct dm_dev *dev,
>> + sector_t start, sector_t len, void *data)
>> +{
>> + int i, *count = data;
>> + struct request_queue *q = bdev_get_queue(dev->bdev);
>> + struct blk_mq_hw_ctx *hctx;
>> +
>> + if (queue_is_mq(q)) {
>> + if (!percpu_ref_tryget(&q->q_usage_counter))
>> + return 0;
>> +
>> + queue_for_each_poll_hw_ctx(q, hctx, i)
>> + *count += blk_mq_poll_hctx(q, hctx);
>> +
>> + percpu_ref_put(&q->q_usage_counter);
>> + } else
>> + *count += dm_poll_one_md(dev->bdev->bd_disk->private_data);
>
> This is fragile, because in the future there may be other bio-based
> drivers that support polling. You should check that "q" is really a device
> mapper device before calling dm_poll_one_md on it.
>
Sorry I missed this reply. Your advice matters, thanks.
--
Thanks,
Jeffle
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [dm-devel] [PATCH v3 11/11] dm: fastpath of bio-based polling
2021-02-19 19:38 ` [dm-devel] " Mikulas Patocka
@ 2021-02-26 8:12 ` JeffleXu
2021-03-02 19:03 ` Mikulas Patocka
0 siblings, 1 reply; 27+ messages in thread
From: JeffleXu @ 2021-02-26 8:12 UTC (permalink / raw)
To: Mikulas Patocka
Cc: axboe, snitzer, caspar, io-uring, linux-block, joseph.qi,
dm-devel, hch
On 2/20/21 3:38 AM, Mikulas Patocka wrote:
>
>
> On Mon, 8 Feb 2021, Jeffle Xu wrote:
>
>> Offer one fastpath of bio-based polling when bio submitted to dm device
>> is not split.
>>
>> In this case, there will be only one bio submitted to only one polling
>> hw queue of one underlying mq device, and thus we don't need to track
>> all split bios or iterate through all polling hw queues. The pointer to
>> the polling hw queue the bio submitted to is returned here as the
>> returned cookie.
>
> This doesn't seem safe - note that between submit_bio() and blk_poll(), no
> locks are held - so the device mapper device may be reconfigured
> arbitrarily. When you call blk_poll() with a pointer returned by
> submit_bio(), the pointer may point to a stale address.
>
Thanks for the feedback. Indeed maybe it's not a good idea to directly
return a 'struct blk_mq_hw_ctx *' pointer as the returned cookie.
Currently I have no idea to fix it, orz... The
blk_get_queue()/blk_put_queue() tricks may not work in this case.
Because the returned cookie may not be used at all. Before calling
blk_poll(), the polling routine may find that the corresponding IO has
already completed, and thus won't call blk_poll(), in which case we have
no place to put the refcount.
But I really don't want to drop this optimization, since this
optimization is quite intuitive when dm device maps to a lot of
underlying devices. Though this optimization doesn't actually achieve
reasonable performance gain in my test, maybe because there are at most
seven nvme devices in my test machine.
Any thoughts?
Thanks,
Jeffle
>
>> In this case, the polling routine will call
>> mq_ops->poll() directly with the hw queue converted from the input
>> cookie.
>>
>> If the original bio submitted to dm device is split to multiple bios and
>> thus submitted to multiple polling hw queues, the bio submission routine
>> will return BLK_QC_T_BIO_MULTI, while the polling routine will fall
>> back to iterating all hw queues (in polling mode) of all underlying mq
>> devices.
>>
>> Signed-off-by: Jeffle Xu <[email protected]>
>> ---
>> block/blk-core.c | 33 +++++++++++++++++++++++++++++++--
>> include/linux/blk_types.h | 8 ++++++++
>> include/linux/types.h | 2 +-
>> 3 files changed, 40 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 37aa513da5f2..cb24b33a4870 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -956,11 +956,19 @@ static blk_qc_t __submit_bio(struct bio *bio)
>> * bio_list_on_stack[0] contains bios submitted by the current ->submit_bio.
>> * bio_list_on_stack[1] contains bios that were submitted before the current
>> * ->submit_bio_bio, but that haven't been processed yet.
>> + *
>> + * Return:
>> + * - BLK_QC_T_NONE, no need for IO polling.
>> + * - BLK_QC_T_BIO_MULTI, @bio gets split and enqueued into multi hw queues.
>> + * - Otherwise, @bio is not split, returning the pointer to the corresponding
>> + * hw queue that the bio enqueued into as the returned cookie.
>> */
>> static blk_qc_t __submit_bio_noacct(struct bio *bio)
>> {
>> struct bio_list bio_list_on_stack[2];
>> blk_qc_t ret = BLK_QC_T_NONE;
>> + struct request_queue *top_q = bio->bi_disk->queue;
>> + bool poll_on = test_bit(QUEUE_FLAG_POLL, &top_q->queue_flags);
>>
>> BUG_ON(bio->bi_next);
>>
>> @@ -968,6 +976,7 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>> current->bio_list = bio_list_on_stack;
>>
>> do {
>> + blk_qc_t cookie;
>> struct request_queue *q = bio->bi_disk->queue;
>> struct bio_list lower, same;
>>
>> @@ -980,7 +989,20 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>> bio_list_on_stack[1] = bio_list_on_stack[0];
>> bio_list_init(&bio_list_on_stack[0]);
>>
>> - ret = __submit_bio(bio);
>> + cookie = __submit_bio(bio);
>> +
>> + if (poll_on &&
>> + blk_qc_t_bio_valid(ret) && blk_qc_t_valid(cookie)) {
>> + unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
>> + struct blk_mq_hw_ctx *hctx = q->queue_hw_ctx[queue_num];
>> +
>> + cookie = (blk_qc_t)hctx;
>> +
>> + if (!blk_qc_t_valid(ret)) /* set initial value */
>> + ret = cookie;
>> + else if (ret != cookie) /* bio got split */
>> + ret = BLK_QC_T_BIO_MULTI;
>> + }
>>
>> /*
>> * Sort new bios into those for a lower level and those for the
>> @@ -1003,6 +1025,7 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>> } while ((bio = bio_list_pop(&bio_list_on_stack[0])));
>>
>> current->bio_list = NULL;
>> +
>> return ret;
>> }
>>
>> @@ -1142,7 +1165,13 @@ static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>> do {
>> int ret;
>>
>> - ret = disk->fops->poll(q, cookie);
>> + if (blk_qc_t_bio_valid(cookie)) {
>> + struct blk_mq_hw_ctx *hctx = (struct blk_mq_hw_ctx *)cookie;
>> + struct request_queue *target_q = hctx->queue;
>> +
>> + ret = blk_mq_poll_hctx(target_q, hctx);
>> + } else
>> + ret = disk->fops->poll(q, cookie);
>> if (ret > 0) {
>> __set_current_state(TASK_RUNNING);
>> return ret;
>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index 2e05244fc16d..4173754532c0 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -557,6 +557,14 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
>> return (cookie & BLK_QC_T_INTERNAL) != 0;
>> }
>>
>> +/* Macros for blk_qc_t used for bio-based polling */
>> +#define BLK_QC_T_BIO_MULTI -2U
>> +
>> +static inline bool blk_qc_t_bio_valid(blk_qc_t cookie)
>> +{
>> + return cookie != BLK_QC_T_BIO_MULTI;
>> +}
>> +
>> struct blk_rq_stat {
>> u64 mean;
>> u64 min;
>> diff --git a/include/linux/types.h b/include/linux/types.h
>> index da5ca7e1bea9..f6301014a459 100644
>> --- a/include/linux/types.h
>> +++ b/include/linux/types.h
>> @@ -126,7 +126,7 @@ typedef u64 sector_t;
>> typedef u64 blkcnt_t;
>>
>> /* cookie used for IO polling */
>> -typedef unsigned int blk_qc_t;
>> +typedef uintptr_t blk_qc_t;
>>
>> /*
>> * The type of an index into the pagecache.
>> --
>> 2.27.0
>>
>> --
>> dm-devel mailing list
>> [email protected]
>> https://listman.redhat.com/mailman/listinfo/dm-devel
>>
>
> --
> dm-devel mailing list
> [email protected]
> https://listman.redhat.com/mailman/listinfo/dm-devel
>
--
Thanks,
Jeffle
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [dm-devel] [PATCH v3 09/11] dm: support IO polling for bio-based dm device
2021-02-19 14:17 ` [dm-devel] " Mikulas Patocka
2021-02-24 1:42 ` JeffleXu
@ 2021-02-26 8:22 ` JeffleXu
1 sibling, 0 replies; 27+ messages in thread
From: JeffleXu @ 2021-02-26 8:22 UTC (permalink / raw)
To: Mikulas Patocka
Cc: axboe, snitzer, caspar, io-uring, linux-block, joseph.qi,
dm-devel, hch
On 2/19/21 10:17 PM, Mikulas Patocka wrote:
>
>
> On Mon, 8 Feb 2021, Jeffle Xu wrote:
>
>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>> index c2945c90745e..8423f1347bb8 100644
>> --- a/drivers/md/dm.c
>> +++ b/drivers/md/dm.c
>> @@ -1657,6 +1657,68 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
>> return BLK_QC_T_NONE;
>> }
>>
>> +static int dm_poll_one_md(struct mapped_device *md);
>> +
>> +static int dm_poll_one_dev(struct dm_target *ti, struct dm_dev *dev,
>> + sector_t start, sector_t len, void *data)
>> +{
>> + int i, *count = data;
>> + struct request_queue *q = bdev_get_queue(dev->bdev);
>> + struct blk_mq_hw_ctx *hctx;
>> +
>> + if (queue_is_mq(q)) {
>> + if (!percpu_ref_tryget(&q->q_usage_counter))
>> + return 0;
>> +
>> + queue_for_each_poll_hw_ctx(q, hctx, i)
>> + *count += blk_mq_poll_hctx(q, hctx);
>> +
>> + percpu_ref_put(&q->q_usage_counter);
>> + } else
>> + *count += dm_poll_one_md(dev->bdev->bd_disk->private_data);
>
> This is fragile, because in the future there may be other bio-based
> drivers that support polling. You should check that "q" is really a device
> mapper device before calling dm_poll_one_md on it.
>
This can be easily fixed by calling disk->fops->poll() recursively, such as
````
if (queue_is_mq(q)) {
...
} else {
//disk = disk of underlying device
pdata->count += disk->fops->poll(q, pdata->cookie);
}
```
But meanwhile I realize that we are recursively calling
disk->fops->poll() here, and thus may cause stack overflow similar to
submit_bio() when the depth of the device stack is large enough.
Maybe the control structure like bio_list shall be needed here, to
flatten the recursive structure here.
Any thoughts?
--
Thanks,
Jeffle
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [dm-devel] [PATCH v3 11/11] dm: fastpath of bio-based polling
2021-02-26 8:12 ` JeffleXu
@ 2021-03-02 19:03 ` Mikulas Patocka
2021-03-03 1:55 ` JeffleXu
0 siblings, 1 reply; 27+ messages in thread
From: Mikulas Patocka @ 2021-03-02 19:03 UTC (permalink / raw)
To: JeffleXu
Cc: axboe, snitzer, caspar, io-uring, linux-block, joseph.qi,
dm-devel, hch
On Fri, 26 Feb 2021, JeffleXu wrote:
>
>
> On 2/20/21 3:38 AM, Mikulas Patocka wrote:
> >
> >
> > On Mon, 8 Feb 2021, Jeffle Xu wrote:
> >
> >> Offer one fastpath of bio-based polling when bio submitted to dm device
> >> is not split.
> >>
> >> In this case, there will be only one bio submitted to only one polling
> >> hw queue of one underlying mq device, and thus we don't need to track
> >> all split bios or iterate through all polling hw queues. The pointer to
> >> the polling hw queue the bio submitted to is returned here as the
> >> returned cookie.
> >
> > This doesn't seem safe - note that between submit_bio() and blk_poll(), no
> > locks are held - so the device mapper device may be reconfigured
> > arbitrarily. When you call blk_poll() with a pointer returned by
> > submit_bio(), the pointer may point to a stale address.
> >
>
> Thanks for the feedback. Indeed maybe it's not a good idea to directly
> return a 'struct blk_mq_hw_ctx *' pointer as the returned cookie.
>
> Currently I have no idea to fix it, orz... The
> blk_get_queue()/blk_put_queue() tricks may not work in this case.
> Because the returned cookie may not be used at all. Before calling
> blk_poll(), the polling routine may find that the corresponding IO has
> already completed, and thus won't call blk_poll(), in which case we have
> no place to put the refcount.
>
> But I really don't want to drop this optimization, since this
> optimization is quite intuitive when dm device maps to a lot of
> underlying devices. Though this optimization doesn't actually achieve
> reasonable performance gain in my test, maybe because there are at most
> seven nvme devices in my test machine.
>
> Any thoughts?
>
> Thanks,
> Jeffle
Hi
I reworked device mapper polling, so that we poll in the function
__split_and_process_bio. The pointer to a queue and the polling cookie is
passed only inside device mapper code, it never leaves it.
I'll send you my patches - try them and tell me how does it perform
compared to your patchset.
Mikulas
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [dm-devel] [PATCH v3 11/11] dm: fastpath of bio-based polling
2021-03-02 19:03 ` Mikulas Patocka
@ 2021-03-03 1:55 ` JeffleXu
0 siblings, 0 replies; 27+ messages in thread
From: JeffleXu @ 2021-03-03 1:55 UTC (permalink / raw)
To: Mikulas Patocka
Cc: axboe, snitzer, caspar, io-uring, linux-block, joseph.qi,
dm-devel, hch
On 3/3/21 3:03 AM, Mikulas Patocka wrote:
>
>
> On Fri, 26 Feb 2021, JeffleXu wrote:
>
>>
>>
>> On 2/20/21 3:38 AM, Mikulas Patocka wrote:
>>>
>>>
>>> On Mon, 8 Feb 2021, Jeffle Xu wrote:
>>>
>>>> Offer one fastpath of bio-based polling when bio submitted to dm device
>>>> is not split.
>>>>
>>>> In this case, there will be only one bio submitted to only one polling
>>>> hw queue of one underlying mq device, and thus we don't need to track
>>>> all split bios or iterate through all polling hw queues. The pointer to
>>>> the polling hw queue the bio submitted to is returned here as the
>>>> returned cookie.
>>>
>>> This doesn't seem safe - note that between submit_bio() and blk_poll(), no
>>> locks are held - so the device mapper device may be reconfigured
>>> arbitrarily. When you call blk_poll() with a pointer returned by
>>> submit_bio(), the pointer may point to a stale address.
>>>
>>
>> Thanks for the feedback. Indeed maybe it's not a good idea to directly
>> return a 'struct blk_mq_hw_ctx *' pointer as the returned cookie.
>>
>> Currently I have no idea to fix it, orz... The
>> blk_get_queue()/blk_put_queue() tricks may not work in this case.
>> Because the returned cookie may not be used at all. Before calling
>> blk_poll(), the polling routine may find that the corresponding IO has
>> already completed, and thus won't call blk_poll(), in which case we have
>> no place to put the refcount.
>>
>> But I really don't want to drop this optimization, since this
>> optimization is quite intuitive when dm device maps to a lot of
>> underlying devices. Though this optimization doesn't actually achieve
>> reasonable performance gain in my test, maybe because there are at most
>> seven nvme devices in my test machine.
>>
>> Any thoughts?
>>
>> Thanks,
>> Jeffle
>
> Hi
>
> I reworked device mapper polling, so that we poll in the function
> __split_and_process_bio. The pointer to a queue and the polling cookie is
> passed only inside device mapper code, it never leaves it.
>
> I'll send you my patches - try them and tell me how does it perform
> compared to your patchset.
>
Thanks. Be glad to hear that you're also working on this. I'm glad to
give some comments on your patch set.
--
Thanks,
Jeffle
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 00/11] dm: support IO polling
2021-02-08 8:52 [PATCH v3 00/11] dm: support IO polling Jeffle Xu
` (11 preceding siblings ...)
2021-02-17 13:15 ` [PATCH v3 00/11] dm: support IO polling JeffleXu
@ 2021-03-10 20:01 ` Mike Snitzer
2021-03-11 7:07 ` [dm-devel] " JeffleXu
12 siblings, 1 reply; 27+ messages in thread
From: Mike Snitzer @ 2021-03-10 20:01 UTC (permalink / raw)
To: Jeffle Xu
Cc: axboe, joseph.qi, caspar, hch, linux-block, dm-devel, io-uring,
ejt
On Mon, Feb 08 2021 at 3:52am -0500,
Jeffle Xu <[email protected]> wrote:
>
> [Performance]
> 1. One thread (numjobs=1) randread (bs=4k, direct=1) one dm-linear
> device, which is built upon 3 nvme devices, with one polling hw
> queue per nvme device.
>
> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---- | --------------- | -------------------- | ----
> dm | 208k | 279k | ~34%
>
>
> 2. Three threads (numjobs=3) randread (bs=4k, direct=1) one dm-linear
> device, which is built upon 3 nvme devices, with one polling hw
> queue per nvme device.
>
> It's compared to 3 threads directly randread 3 nvme devices, with one
> polling hw queue per nvme device. No CPU affinity set for these 3
> threads. Thus every thread can access every nvme device
> (filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread
> need to competing for every polling hw queue.
>
> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---- | --------------- | -------------------- | ----
> dm | 615k | 728k | ~18%
> nvme | 728k | 873k | ~20%
>
> The result shows that the performance gain of bio-based polling is
> comparable as that of mq polling in the same test case.
>
>
> 3. Three threads (numjobs=3) randread (bs=12k, direct=1) one
> **dm-stripe** device, which is built upon 3 nvme devices, with one
> polling hw queue per nvme device.
>
> It's compared to 3 threads directly randread 3 nvme devices, with one
> polling hw queue per nvme device. No CPU affinity set for these 3
> threads. Thus every thread can access every nvme device
> (filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread
> need to competing for every polling hw queue.
>
> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---- | --------------- | -------------------- | ----
> dm | 314k | 354k | ~13%
> nvme | 728k | 873k | ~20%
>
So this "3." case is meant to illustrate effects of polling when bio is
split to 3 different underlying blk-mq devices. (think it odd that
dm-stripe across 3 nvme devices is performing so much worse than a
single nvme device)
Would be nice to see comparison of this same workload but _with_ CPU
affinity to show relative benefit of your patch 11 in this series (where
you try to leverage CPU pinning).
In general, I don't think patch 11 a worthwhile effort. It is a
half-measure that is trying to paper over the fact that this bio-based
IO polling patchset is quite coarse grained about how it copes with the
need to poll multiple devices.
Patch 10 is a proper special case that should be accounted for (when a
bio isn't split and only gets submitted to a single blk-mq
device/hctx). But even patch 10's approach is fragile (we we've
discussed in private and I'll touch on in reply to patch 10).
But I think patch 11 should be dropped and we defer optimizing bio
splitting at a later time with follow-on work.
Just so you're aware, I'm now thinking the proper way forward is to
update DM core, at the time of DM table load, to assemble an array of
underlying _data_ devices in that table (as iterated with
.iterate_devices) -- this would allow each underlying data device to be
assigned an immutable index for the lifetime of a DM table. It'd be
hooked off the 'struct dm_table' and would share that object's
lifetime.
With that bit of infrastructure in place, we could then take steps to
make DM's cookie more dense in its description of underlying devices
that need polling. This is where I'll get a bit handwavvy.. but I
raised this idea with Joe Thornber and he is going to have a think about
it too.
But this is all to say: optimizing the complex case of bio-splitting
that is such an integral part of bio-based IO processing needs more than
what you've attempted to do (noble effort on your part but again, really
just a half-measure).
SO I think it best to keep the initial implementation of bio-based
polling relatively simple by laying foundation for follow-on work. And
it is also important to _not_ encode in block core some meaning to what
_should_ be a largely opaque cookie (blk_qc_t) that is for the
underlying driver to make sense of.
>
> 4. This patchset shall do no harm to the performance of the original mq
> polling. Following is the test results of one thread (numjobs=1)
> randread (bs=4k, direct=1) one nvme device.
>
> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---------------- | --------------- | -------------------- | ----
> without patchset | 242k | 332k | ~39%
> with patchset | 236k | 332k | ~39%
OK, good, this needs to be the case.
>
> [Changes since v2]
>
> Patchset v2 caches all hw queues (in polling mode) of underlying mq
> devices in dm layer. The polling routine actually iterates through all
> these cached hw queues.
>
> However, mq may change the queue mapping at runtime (e.g., NVMe RESET
> command), thus the cached hw queues in dm layer may be out-of-date. Thus
> patchset v3 falls back to the implementation of the very first RFC
> version, in which the mq layer needs to export one interface iterating
> all polling hw queues (patch 5), and the bio-based polling routine just
> calls this interface to iterate all polling hw queues.
>
> Besides, several new optimization is proposed.
>
>
> - patch 1,2,7
> same as v2, untouched
>
> - patch 3
> Considering advice from Christoph Hellwig, while refactoring blk_poll(),
> split mq and bio-based polling routine from the very beginning. Now
> blk_poll() is just a simple entry. blk_bio_poll() is simply copied from
> blk_mq_poll(), while the loop structure is some sort of duplication
> though.
>
> - patch 4
> This patch is newly added to support turning on/off polling through
> '/sys/block/<dev>/queue/io_poll' dynamiclly for bio-based devices.
> Patchset v2 implemented this functionality by added one new queue flag,
> which is not preferred since the queue flag resource is quite short of
> nowadays.
>
> - patch 5
> This patch is newly added, preparing for the following bio-based
> polling. The following bio-based polling will call this helper function,
> accounting on the corresponding hw queue.
>
> - patch 6
> It's from the very first RFC version, preparing for the following
> bio-based polling.
>
> - patch 8
> One fixing patch needed by the following bio-based polling. It's
> actually a v2 of [1]. I had sent the v2 singly in-reply-to [1], though
> it has not been visible on the mailing list maybe due to the delay.
>
> - patch 9
> It's from the very first RFC version.
>
> - patch 10
> This patch is newly added. Patchset v2 had ever proposed one
> optimization that, skipping the **busy** hw queues during the iteration
> phase. Back upon that time, one flag of 'atomic_t' is specifically
> maintained in dm layer, representing if the corresponding hw queue is
> busy or not. The idea is inherited, while the implementation changes.
> Now @nvmeq->cq_poll_lock is used directly here, no need for extra flag
> anymore.
>
> This optimization can significantly reduce the competition for one hw
> queue between multiple polling instances. Following statistics is the
> test result when 3 threads concurrently randread (bs=4k, direct=1) one
> dm-linear device, which is built upon 3 nvme devices, with one polling
> hw queue per nvme device.
>
> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ----------- | --------------- | -------------------- | ----
> without opt | 318k | 256k | ~-20%
> with opt | 314k | 354k | ~13%
>
>
> - patch 11
> This is another newly added optimizatin for bio-based polling.
>
> One intuitive insight is that, when the original bio submitted to dm
> device doesn't get split, then the bio gets enqueued into only one hw
> queue of one of the underlying mq devices. In this case, we no longer
> need to track all split bios, and one cookie (for the only split bio)
> is enough. It is implemented by returning the pointer to the
> corresponding hw queue in this case.
>
> It should be safe by directly returning the pointer to the hw queue,
> since 'struct blk_mq_hw_ctx' won't be freed during the whole lifetime of
> 'struct request_queue'. Even when the number of hw queues may decrease
> when NVMe RESET happens, the 'struct request_queue' structure of decreased
> hw queues won't be freed, instead it's buffered into
> &q->unused_hctx_list list.
>
> Though this optimization seems quite intuitive, the performance test
> shows that it does no benefit nor harm to the performance, while 3
> threads concurrently randreading (bs=4k, direct=1) one dm-linear
> device, which is built upon 3 nvme devices, with one polling hw queue
> per nvme device.
>
> I'm not sure why it doesn't work, maybe because the number of devices,
> or the depth of the devcice stack is to low in my test case?
Looks like these patch references are stale (was relative to v3 I
think).. e.g: "patch 11" from v3 is really "patch 10" in v5.. but its
implementation has changed because Mikulas pointed out that the
implementation was unsafe.. IIRC?
Anyway, I'll just focus on reviewing each patch in this v5 now.
Thanks,
Mike
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [dm-devel] [PATCH v3 00/11] dm: support IO polling
2021-03-10 20:01 ` Mike Snitzer
@ 2021-03-11 7:07 ` JeffleXu
0 siblings, 0 replies; 27+ messages in thread
From: JeffleXu @ 2021-03-11 7:07 UTC (permalink / raw)
To: Mike Snitzer
Cc: axboe, ejt, caspar, io-uring, linux-block, joseph.qi, dm-devel,
hch
On 3/11/21 4:01 AM, Mike Snitzer wrote:
> On Mon, Feb 08 2021 at 3:52am -0500,
> Jeffle Xu <[email protected]> wrote:
>
>>
>> [Performance]
>> 1. One thread (numjobs=1) randread (bs=4k, direct=1) one dm-linear
>> device, which is built upon 3 nvme devices, with one polling hw
>> queue per nvme device.
>>
>> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
>> ---- | --------------- | -------------------- | ----
>> dm | 208k | 279k | ~34%
>>
>>
>> 2. Three threads (numjobs=3) randread (bs=4k, direct=1) one dm-linear
>> device, which is built upon 3 nvme devices, with one polling hw
>> queue per nvme device.
>>
>> It's compared to 3 threads directly randread 3 nvme devices, with one
>> polling hw queue per nvme device. No CPU affinity set for these 3
>> threads. Thus every thread can access every nvme device
>> (filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread
>> need to competing for every polling hw queue.
>>
>> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
>> ---- | --------------- | -------------------- | ----
>> dm | 615k | 728k | ~18%
>> nvme | 728k | 873k | ~20%
>>
>> The result shows that the performance gain of bio-based polling is
>> comparable as that of mq polling in the same test case.
>>
>>
>> 3. Three threads (numjobs=3) randread (bs=12k, direct=1) one
>> **dm-stripe** device, which is built upon 3 nvme devices, with one
>> polling hw queue per nvme device.
>>
>> It's compared to 3 threads directly randread 3 nvme devices, with one
>> polling hw queue per nvme device. No CPU affinity set for these 3
>> threads. Thus every thread can access every nvme device
>> (filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread
>> need to competing for every polling hw queue.
>>
>> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
>> ---- | --------------- | -------------------- | ----
>> dm | 314k | 354k | ~13%
>> nvme | 728k | 873k | ~20%
>>
>
> So this "3." case is meant to illustrate effects of polling when bio is
> split to 3 different underlying blk-mq devices. (think it odd that
> dm-stripe across 3 nvme devices is performing so much worse than a
> single nvme device)
>
> Would be nice to see comparison of this same workload but _with_ CPU
> affinity to show relative benefit of your patch 11 in this series (where
> you try to leverage CPU pinning).
Actually I have done the test. You can see that without the optimization
in patch 1,, i.e., to leverage CPU pinning, there's only ~13%
performance gain of polling.
As I stated here [1] (see the clause two "2. sub-fastpath"), with the
optimization in patch 11, the performance gain is ~32%. The optimization
behave well. It mainly decreases the race where multiple CPUs competing
for the same hw queue.
[1]
https://patchwork.kernel.org/project/dm-devel/cover/[email protected]/
>
> In general, I don't think patch 11 a worthwhile effort. It is a
> half-measure that is trying to paper over the fact that this bio-based
> IO polling patchset is quite coarse grained about how it copes with the
> need to poll multiple devices.
Indeed. It is a result of somewhat compromise. If we have better scheme
improving multi-thread performance, it shall certainly go... I can
explain the difficulty of implementing the optimization in other ways in
the following comment.
>
> Patch 10 is a proper special case that should be accounted for (when a
> bio isn't split and only gets submitted to a single blk-mq
> device/hctx). But even patch 10's approach is fragile (we we've
> discussed in private and I'll touch on in reply to patch 10).
>
> But I think patch 11 should be dropped and we defer optimizing bio
> splitting at a later time with follow-on work.
>
> Just so you're aware, I'm now thinking the proper way forward is to
> update DM core, at the time of DM table load, to assemble an array of
> underlying _data_ devices in that table (as iterated with
> .iterate_devices) -- this would allow each underlying data device to be
> assigned an immutable index for the lifetime of a DM table. It'd be
> hooked off the 'struct dm_table' and would share that object's
> lifetime.
I have also considered it. There's two way organizing the table, with
each having its own defect though.
1. Each dm table maintains underlying devices at its own level, e.g., if
dm-0 is built upon dm-1, then the dm table of dm-0 maintains all
underlying devices of dm-0, while the dm table of dm-1 maintaining its
own underlying devices.
In this case, there are multiple arrays at each level. If the returned
cookie is still one integer type, it may be difficult to design the
cookie, making it reflecting one slot among all these multiple arrays.
If the returned cookie is actually a group of indexes, with each
indexing one dm table, then where and how to store these indexes...
2. The dm device at the top most level maintains a large array,
aggregating all underlying devices on all levels.
In this case, one unique array is maintained and one cookie of integer
type is enough. However the defect is that, when one underlying dm
device reloads table, how and when to update the array in the top most
level.
>
> With that bit of infrastructure in place, we could then take steps to
> make DM's cookie more dense in its description of underlying devices
> that need polling. This is where I'll get a bit handwavvy.. but I
> raised this idea with Joe Thornber and he is going to have a think about
> it too.
>
> But this is all to say: optimizing the complex case of bio-splitting
> that is such an integral part of bio-based IO processing needs more than
> what you've attempted to do (noble effort on your part but again, really
> just a half-measure).
>
> SO I think it best to keep the initial implementation of bio-based
> polling relatively simple by laying foundation for follow-on work. And
> it is also important to _not_ encode in block core some meaning to what
> _should_ be a largely opaque cookie (blk_qc_t) that is for the
> underlying driver to make sense of.
>
Agreed. The difficulty of the design pattern is explained in the
corresponding reply for v5.
>
>>
>> 4. This patchset shall do no harm to the performance of the original mq
>> polling. Following is the test results of one thread (numjobs=1)
>> randread (bs=4k, direct=1) one nvme device.
>>
>> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
>> ---------------- | --------------- | -------------------- | ----
>> without patchset | 242k | 332k | ~39%
>> with patchset | 236k | 332k | ~39%
>
> OK, good, this needs to be the case.
>
>>
>> [Changes since v2]
>>
>> Patchset v2 caches all hw queues (in polling mode) of underlying mq
>> devices in dm layer. The polling routine actually iterates through all
>> these cached hw queues.
>>
>> However, mq may change the queue mapping at runtime (e.g., NVMe RESET
>> command), thus the cached hw queues in dm layer may be out-of-date. Thus
>> patchset v3 falls back to the implementation of the very first RFC
>> version, in which the mq layer needs to export one interface iterating
>> all polling hw queues (patch 5), and the bio-based polling routine just
>> calls this interface to iterate all polling hw queues.
>>
>> Besides, several new optimization is proposed.
>>
>>
>> - patch 1,2,7
>> same as v2, untouched
>>
>> - patch 3
>> Considering advice from Christoph Hellwig, while refactoring blk_poll(),
>> split mq and bio-based polling routine from the very beginning. Now
>> blk_poll() is just a simple entry. blk_bio_poll() is simply copied from
>> blk_mq_poll(), while the loop structure is some sort of duplication
>> though.
>>
>> - patch 4
>> This patch is newly added to support turning on/off polling through
>> '/sys/block/<dev>/queue/io_poll' dynamiclly for bio-based devices.
>> Patchset v2 implemented this functionality by added one new queue flag,
>> which is not preferred since the queue flag resource is quite short of
>> nowadays.
>>
>> - patch 5
>> This patch is newly added, preparing for the following bio-based
>> polling. The following bio-based polling will call this helper function,
>> accounting on the corresponding hw queue.
>>
>> - patch 6
>> It's from the very first RFC version, preparing for the following
>> bio-based polling.
>>
>> - patch 8
>> One fixing patch needed by the following bio-based polling. It's
>> actually a v2 of [1]. I had sent the v2 singly in-reply-to [1], though
>> it has not been visible on the mailing list maybe due to the delay.
>>
>> - patch 9
>> It's from the very first RFC version.
>>
>> - patch 10
>> This patch is newly added. Patchset v2 had ever proposed one
>> optimization that, skipping the **busy** hw queues during the iteration
>> phase. Back upon that time, one flag of 'atomic_t' is specifically
>> maintained in dm layer, representing if the corresponding hw queue is
>> busy or not. The idea is inherited, while the implementation changes.
>> Now @nvmeq->cq_poll_lock is used directly here, no need for extra flag
>> anymore.
>>
>> This optimization can significantly reduce the competition for one hw
>> queue between multiple polling instances. Following statistics is the
>> test result when 3 threads concurrently randread (bs=4k, direct=1) one
>> dm-linear device, which is built upon 3 nvme devices, with one polling
>> hw queue per nvme device.
>>
>> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
>> ----------- | --------------- | -------------------- | ----
>> without opt | 318k | 256k | ~-20%
>> with opt | 314k | 354k | ~13%
>>
>>
>> - patch 11
>> This is another newly added optimizatin for bio-based polling.
>>
>> One intuitive insight is that, when the original bio submitted to dm
>> device doesn't get split, then the bio gets enqueued into only one hw
>> queue of one of the underlying mq devices. In this case, we no longer
>> need to track all split bios, and one cookie (for the only split bio)
>> is enough. It is implemented by returning the pointer to the
>> corresponding hw queue in this case.
>>
>> It should be safe by directly returning the pointer to the hw queue,
>> since 'struct blk_mq_hw_ctx' won't be freed during the whole lifetime of
>> 'struct request_queue'. Even when the number of hw queues may decrease
>> when NVMe RESET happens, the 'struct request_queue' structure of decreased
>> hw queues won't be freed, instead it's buffered into
>> &q->unused_hctx_list list.
>>
>> Though this optimization seems quite intuitive, the performance test
>> shows that it does no benefit nor harm to the performance, while 3
>> threads concurrently randreading (bs=4k, direct=1) one dm-linear
>> device, which is built upon 3 nvme devices, with one polling hw queue
>> per nvme device.
>>
>> I'm not sure why it doesn't work, maybe because the number of devices,
>> or the depth of the devcice stack is to low in my test case?
>
> Looks like these patch references are stale (was relative to v3 I
> think).. e.g: "patch 11" from v3 is really "patch 10" in v5.. but its
> implementation has changed because Mikulas pointed out that the
> implementation was unsafe.. IIRC?
>
> Anyway, I'll just focus on reviewing each patch in this v5 now.
>
> Thanks,
> Mike
>
> --
> dm-devel mailing list
> [email protected]
> https://listman.redhat.com/mailman/listinfo/dm-devel
>
--
Thanks,
Jeffle
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2021-03-11 7:08 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-02-08 8:52 [PATCH v3 00/11] dm: support IO polling Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 01/11] block: move definition of blk_qc_t to types.h Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 02/11] block: add queue_to_disk() to get gendisk from request_queue Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 03/11] block: add poll method to support bio-based IO polling Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 04/11] block: add poll_capable " Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 05/11] block/mq: extract one helper function polling hw queue Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 06/11] block/mq: add iterator for polling hw queues Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 07/11] dm: always return BLK_QC_T_NONE for bio-based device Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 08/11] dm: fix iterate_device sanity check Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 09/11] dm: support IO polling for bio-based dm device Jeffle Xu
2021-02-09 3:11 ` Ming Lei
2021-02-09 6:13 ` JeffleXu
2021-02-09 6:22 ` JeffleXu
2021-02-09 8:07 ` Ming Lei
2021-02-09 8:46 ` JeffleXu
2021-02-19 14:17 ` [dm-devel] " Mikulas Patocka
2021-02-24 1:42 ` JeffleXu
2021-02-26 8:22 ` JeffleXu
2021-02-08 8:52 ` [PATCH v3 10/11] nvme/pci: don't wait for locked polling queue Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 11/11] dm: fastpath of bio-based polling Jeffle Xu
2021-02-19 19:38 ` [dm-devel] " Mikulas Patocka
2021-02-26 8:12 ` JeffleXu
2021-03-02 19:03 ` Mikulas Patocka
2021-03-03 1:55 ` JeffleXu
2021-02-17 13:15 ` [PATCH v3 00/11] dm: support IO polling JeffleXu
2021-03-10 20:01 ` Mike Snitzer
2021-03-11 7:07 ` [dm-devel] " JeffleXu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox