From: JeffleXu <[email protected]>
To: [email protected], [email protected]
Cc: [email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected]
Subject: Re: [PATCH v3 00/11] dm: support IO polling
Date: Wed, 17 Feb 2021 21:15:32 +0800 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
Hi,
Any comment on this series?
Thanks,
Jeffle
On 2/8/21 4:52 PM, Jeffle Xu wrote:
>
> [Performance]
> 1. One thread (numjobs=1) randread (bs=4k, direct=1) one dm-linear
> device, which is built upon 3 nvme devices, with one polling hw
> queue per nvme device.
>
> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---- | --------------- | -------------------- | ----
> dm | 208k | 279k | ~34%
>
>
> 2. Three threads (numjobs=3) randread (bs=4k, direct=1) one dm-linear
> device, which is built upon 3 nvme devices, with one polling hw
> queue per nvme device.
>
> It's compared to 3 threads directly randread 3 nvme devices, with one
> polling hw queue per nvme device. No CPU affinity set for these 3
> threads. Thus every thread can access every nvme device
> (filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread
> need to competing for every polling hw queue.
>
> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---- | --------------- | -------------------- | ----
> dm | 615k | 728k | ~18%
> nvme | 728k | 873k | ~20%
>
> The result shows that the performance gain of bio-based polling is
> comparable as that of mq polling in the same test case.
>
>
> 3. Three threads (numjobs=3) randread (bs=12k, direct=1) one
> **dm-stripe** device, which is built upon 3 nvme devices, with one
> polling hw queue per nvme device.
>
> It's compared to 3 threads directly randread 3 nvme devices, with one
> polling hw queue per nvme device. No CPU affinity set for these 3
> threads. Thus every thread can access every nvme device
> (filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread
> need to competing for every polling hw queue.
>
> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---- | --------------- | -------------------- | ----
> dm | 314k | 354k | ~13%
> nvme | 728k | 873k | ~20%
>
>
> 4. This patchset shall do no harm to the performance of the original mq
> polling. Following is the test results of one thread (numjobs=1)
> randread (bs=4k, direct=1) one nvme device.
>
> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---------------- | --------------- | -------------------- | ----
> without patchset | 242k | 332k | ~39%
> with patchset | 236k | 332k | ~39%
>
>
>
> [Changes since v2]
>
> Patchset v2 caches all hw queues (in polling mode) of underlying mq
> devices in dm layer. The polling routine actually iterates through all
> these cached hw queues.
>
> However, mq may change the queue mapping at runtime (e.g., NVMe RESET
> command), thus the cached hw queues in dm layer may be out-of-date. Thus
> patchset v3 falls back to the implementation of the very first RFC
> version, in which the mq layer needs to export one interface iterating
> all polling hw queues (patch 5), and the bio-based polling routine just
> calls this interface to iterate all polling hw queues.
>
> Besides, several new optimization is proposed.
>
>
> - patch 1,2,7
> same as v2, untouched
>
> - patch 3
> Considering advice from Christoph Hellwig, while refactoring blk_poll(),
> split mq and bio-based polling routine from the very beginning. Now
> blk_poll() is just a simple entry. blk_bio_poll() is simply copied from
> blk_mq_poll(), while the loop structure is some sort of duplication
> though.
>
> - patch 4
> This patch is newly added to support turning on/off polling through
> '/sys/block/<dev>/queue/io_poll' dynamiclly for bio-based devices.
> Patchset v2 implemented this functionality by added one new queue flag,
> which is not preferred since the queue flag resource is quite short of
> nowadays.
>
> - patch 5
> This patch is newly added, preparing for the following bio-based
> polling. The following bio-based polling will call this helper function,
> accounting on the corresponding hw queue.
>
> - patch 6
> It's from the very first RFC version, preparing for the following
> bio-based polling.
>
> - patch 8
> One fixing patch needed by the following bio-based polling. It's
> actually a v2 of [1]. I had sent the v2 singly in-reply-to [1], though
> it has not been visible on the mailing list maybe due to the delay.
>
> - patch 9
> It's from the very first RFC version.
>
> - patch 10
> This patch is newly added. Patchset v2 had ever proposed one
> optimization that, skipping the **busy** hw queues during the iteration
> phase. Back upon that time, one flag of 'atomic_t' is specifically
> maintained in dm layer, representing if the corresponding hw queue is
> busy or not. The idea is inherited, while the implementation changes.
> Now @nvmeq->cq_poll_lock is used directly here, no need for extra flag
> anymore.
>
> This optimization can significantly reduce the competition for one hw
> queue between multiple polling instances. Following statistics is the
> test result when 3 threads concurrently randread (bs=4k, direct=1) one
> dm-linear device, which is built upon 3 nvme devices, with one polling
> hw queue per nvme device.
>
> | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ----------- | --------------- | -------------------- | ----
> without opt | 318k | 256k | ~-20%
> with opt | 314k | 354k | ~13%
>
>
> - patch 11
> This is another newly added optimizatin for bio-based polling.
>
> One intuitive insight is that, when the original bio submitted to dm
> device doesn't get split, then the bio gets enqueued into only one hw
> queue of one of the underlying mq devices. In this case, we no longer
> need to track all split bios, and one cookie (for the only split bio)
> is enough. It is implemented by returning the pointer to the
> corresponding hw queue in this case.
>
> It should be safe by directly returning the pointer to the hw queue,
> since 'struct blk_mq_hw_ctx' won't be freed during the whole lifetime of
> 'struct request_queue'. Even when the number of hw queues may decrease
> when NVMe RESET happens, the 'struct request_queue' structure of decreased
> hw queues won't be freed, instead it's buffered into
> &q->unused_hctx_list list.
>
> Though this optimization seems quite intuitive, the performance test
> shows that it does no benefit nor harm to the performance, while 3
> threads concurrently randreading (bs=4k, direct=1) one dm-linear
> device, which is built upon 3 nvme devices, with one polling hw queue
> per nvme device.
>
> I'm not sure why it doesn't work, maybe because the number of devices,
> or the depth of the devcice stack is to low in my test case?
>
>
> [Remained Issue]
> It has been mentioned in patch 4 that, users could change the state of
> the underlying devices through '/sys/block/<dev>/io_poll', bypassing
> the dm device above. Thus it can cause a situation where QUEUE_FLAG_POLL
> is still set for the request_queue of dm device, while one of the
> underlying mq device may has cleared this flag.
>
> In this case, it will pass the 'test_bit(QUEUE_FLAG_POLL, &q->queue_flags)'
> check in blk_poll(), while the input cookie may actually points to a hw
> queue in IRQ mode since patch 11. Thus for this hw queue (in IRQ mode),
> the bio-based polling routine will handle this hw queue acquiring
> 'spin_lock(&nvmeq->cq_poll_lock)' (refer
> drivers/nvme/host/pci.c:nvme_poll), which is not adequate since this hw
> queue may also be accessed in IRQ context. In other words,
> spin_lock_irq() should be used here.
>
> I have not come up one simple way to fix it. I don't want to do sanity
> check (e.g., the type of the hw queue is HCTX_TYPE_POLL or not) in the
> IO path (submit_bio()/blk_poll()), i.e., fast path.
>
> We'd better fix it in the control path, i.e., dm could be aware of the
> change when attribute (e.g., support io_poll or not) of one of the
> underlying devices changed at runtime.
>
> [1] https://listman.redhat.com/archives/dm-devel/2021-February/msg00028.html
>
>
> changes since v1:
> - patch 1,2,4 is the same as v1 and have already been reviewed
> - patch 3 is refactored a bit on the basis of suggestions from
> Mike Snitzer.
> - patch 5 is newly added and introduces one new queue flag
> representing if the queue is capable of IO polling. This mainly
> simplifies the logic in queue_poll_store().
> - patch 6 implements the core mechanism supporting IO polling.
> The sanity check checking if the dm device supports IO polling is
> also folded into this patch, and the queue flag will be cleared if
> it doesn't support, in case of table reloading.
>
>
>
>
> Jeffle Xu (11):
> block: move definition of blk_qc_t to types.h
> block: add queue_to_disk() to get gendisk from request_queue
> block: add poll method to support bio-based IO polling
> block: add poll_capable method to support bio-based IO polling
> block/mq: extract one helper function polling hw queue
> block/mq: add iterator for polling hw queues
> dm: always return BLK_QC_T_NONE for bio-based device
> dm: fix iterate_device sanity check
> dm: support IO polling for bio-based dm device
> nvme/pci: don't wait for locked polling queue
> dm: fastpath of bio-based polling
>
> block/blk-core.c | 89 +++++++++++++-
> block/blk-mq.c | 27 +----
> block/blk-sysfs.c | 14 ++-
> drivers/md/dm-table.c | 215 ++++++++++++++++++----------------
> drivers/md/dm.c | 90 +++++++++++---
> drivers/md/dm.h | 2 +-
> drivers/nvme/host/pci.c | 4 +-
> include/linux/blk-mq.h | 21 ++++
> include/linux/blk_types.h | 10 +-
> include/linux/blkdev.h | 4 +
> include/linux/device-mapper.h | 1 +
> include/linux/fs.h | 2 +-
> include/linux/types.h | 3 +
> include/trace/events/kyber.h | 6 +-
> 14 files changed, 333 insertions(+), 155 deletions(-)
>
--
Thanks,
Jeffle
next prev parent reply other threads:[~2021-02-17 13:16 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-02-08 8:52 [PATCH v3 00/11] dm: support IO polling Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 01/11] block: move definition of blk_qc_t to types.h Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 02/11] block: add queue_to_disk() to get gendisk from request_queue Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 03/11] block: add poll method to support bio-based IO polling Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 04/11] block: add poll_capable " Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 05/11] block/mq: extract one helper function polling hw queue Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 06/11] block/mq: add iterator for polling hw queues Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 07/11] dm: always return BLK_QC_T_NONE for bio-based device Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 08/11] dm: fix iterate_device sanity check Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 09/11] dm: support IO polling for bio-based dm device Jeffle Xu
2021-02-09 3:11 ` Ming Lei
2021-02-09 6:13 ` JeffleXu
2021-02-09 6:22 ` JeffleXu
2021-02-09 8:07 ` Ming Lei
2021-02-09 8:46 ` JeffleXu
2021-02-19 14:17 ` [dm-devel] " Mikulas Patocka
2021-02-24 1:42 ` JeffleXu
2021-02-26 8:22 ` JeffleXu
2021-02-08 8:52 ` [PATCH v3 10/11] nvme/pci: don't wait for locked polling queue Jeffle Xu
2021-02-08 8:52 ` [PATCH v3 11/11] dm: fastpath of bio-based polling Jeffle Xu
2021-02-19 19:38 ` [dm-devel] " Mikulas Patocka
2021-02-26 8:12 ` JeffleXu
2021-03-02 19:03 ` Mikulas Patocka
2021-03-03 1:55 ` JeffleXu
2021-02-17 13:15 ` JeffleXu [this message]
2021-03-10 20:01 ` [PATCH v3 00/11] dm: support IO polling Mike Snitzer
2021-03-11 7:07 ` [dm-devel] " JeffleXu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7d440a6f-e533-f34c-90a2-4afd3c3d9b1c@linux.alibaba.com \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox