* [PATCH 1/3] block: add completion handler for fast path
2021-12-15 16:30 [PATCHSET 0/3] Improve IRQ driven performance Jens Axboe
@ 2021-12-15 16:30 ` Jens Axboe
2021-12-16 9:10 ` Christoph Hellwig
2021-12-15 16:30 ` [PATCH 2/3] block: use singly linked list for bio cache Jens Axboe
2021-12-15 16:30 ` [PATCH 3/3] block: enable bio allocation cache for IRQ driven IO Jens Axboe
2 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2021-12-15 16:30 UTC (permalink / raw)
To: io-uring, linux-block; +Cc: Jens Axboe
The batched completions only deal with non-partial requests anyway,
and it doesn't deal with any requests that have errors. Add a completion
handler that assumes it's a full request and that it's all being ended
successfully.
Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-mq.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 42 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f24394cb2004..efa9b93b4ddb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -709,6 +709,47 @@ static void blk_print_req_error(struct request *req, blk_status_t status)
IOPRIO_PRIO_CLASS(req->ioprio));
}
+/*
+ * Fully end IO on a request. Does not support partial completions, or
+ * errors.
+ */
+static void blk_complete_request(struct request *req)
+{
+ const bool is_flush = (req->rq_flags & RQF_FLUSH_SEQ) != 0;
+ int total_bytes = blk_rq_bytes(req);
+ struct bio *bio = req->bio;
+
+ trace_block_rq_complete(req, BLK_STS_OK, total_bytes);
+
+ if (!bio)
+ return;
+
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+ if (blk_integrity_rq(req) && req_op(req) == REQ_OP_READ)
+ req->q->integrity.profile->complete_fn(req, total_bytes);
+#endif
+
+ blk_account_io_completion(req, total_bytes);
+
+ do {
+ struct bio *next = bio->bi_next;
+
+ /* Completion has already been traced */
+ bio_clear_flag(bio, BIO_TRACE_COMPLETION);
+ if (!is_flush)
+ bio_endio(bio);
+ bio = next;
+ } while (bio);
+
+ /*
+ * Reset counters so that the request stacking driver
+ * can find how many bytes remain in the request
+ * later.
+ */
+ req->bio = NULL;
+ req->__data_len = 0;
+}
+
/**
* blk_update_request - Complete multiple bytes without completing the request
* @req: the request being processed
@@ -922,7 +963,7 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
prefetch(rq->bio);
prefetch(rq->rq_next);
- blk_update_request(rq, BLK_STS_OK, blk_rq_bytes(rq));
+ blk_complete_request(rq);
if (iob->need_ts)
__blk_mq_end_request_acct(rq, now);
--
2.34.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/3] block: use singly linked list for bio cache
2021-12-15 16:30 [PATCHSET 0/3] Improve IRQ driven performance Jens Axboe
2021-12-15 16:30 ` [PATCH 1/3] block: add completion handler for fast path Jens Axboe
@ 2021-12-15 16:30 ` Jens Axboe
2021-12-16 9:11 ` Christoph Hellwig
2021-12-15 16:30 ` [PATCH 3/3] block: enable bio allocation cache for IRQ driven IO Jens Axboe
2 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2021-12-15 16:30 UTC (permalink / raw)
To: io-uring, linux-block; +Cc: Jens Axboe
Pointless to maintain a head/tail for the list, as we never need to
access the tail. Entries are always LIFO for cache hotness reasons.
Signed-off-by: Jens Axboe <[email protected]>
---
block/bio.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index d9d8e1143edc..a76a3134625a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -26,7 +26,7 @@
#include "blk-rq-qos.h"
struct bio_alloc_cache {
- struct bio_list free_list;
+ struct bio *free_list;
unsigned int nr;
};
@@ -630,7 +630,9 @@ static void bio_alloc_cache_prune(struct bio_alloc_cache *cache,
unsigned int i = 0;
struct bio *bio;
- while ((bio = bio_list_pop(&cache->free_list)) != NULL) {
+ while (cache->free_list) {
+ bio = cache->free_list;
+ cache->free_list = bio->bi_next;
cache->nr--;
bio_free(bio);
if (++i == nr)
@@ -689,7 +691,8 @@ void bio_put(struct bio *bio)
bio_uninit(bio);
cache = per_cpu_ptr(bio->bi_pool->cache, get_cpu());
- bio_list_add_head(&cache->free_list, bio);
+ bio->bi_next = cache->free_list;
+ cache->free_list = bio;
if (++cache->nr > ALLOC_CACHE_MAX + ALLOC_CACHE_SLACK)
bio_alloc_cache_prune(cache, ALLOC_CACHE_SLACK);
put_cpu();
@@ -1700,8 +1703,9 @@ struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned short nr_vecs,
return bio_alloc_bioset(GFP_KERNEL, nr_vecs, bs);
cache = per_cpu_ptr(bs->cache, get_cpu());
- bio = bio_list_pop(&cache->free_list);
- if (bio) {
+ if (cache->free_list) {
+ bio = cache->free_list;
+ cache->free_list = bio->bi_next;
cache->nr--;
put_cpu();
bio_init(bio, nr_vecs ? bio->bi_inline_vecs : NULL, nr_vecs);
--
2.34.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 2/3] block: use singly linked list for bio cache
2021-12-15 16:30 ` [PATCH 2/3] block: use singly linked list for bio cache Jens Axboe
@ 2021-12-16 9:11 ` Christoph Hellwig
0 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2021-12-16 9:11 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring, linux-block
On Wed, Dec 15, 2021 at 09:30:08AM -0700, Jens Axboe wrote:
> Pointless to maintain a head/tail for the list, as we never need to
> access the tail. Entries are always LIFO for cache hotness reasons.
>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> block/bio.c | 14 +++++++++-----
> 1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index d9d8e1143edc..a76a3134625a 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -26,7 +26,7 @@
> #include "blk-rq-qos.h"
>
> struct bio_alloc_cache {
> - struct bio_list free_list;
> + struct bio *free_list;
> unsigned int nr;
> };
>
> @@ -630,7 +630,9 @@ static void bio_alloc_cache_prune(struct bio_alloc_cache *cache,
> unsigned int i = 0;
> struct bio *bio;
>
> - while ((bio = bio_list_pop(&cache->free_list)) != NULL) {
> + while (cache->free_list) {
> + bio = cache->free_list;
Nit:
while ((bio = cache->free_list) != NULL) {
would mke the iteration a litle more obvious.
Otherwise looks good:
Reviewed-by: Christoph Hellwig <[email protected]>
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 3/3] block: enable bio allocation cache for IRQ driven IO
2021-12-15 16:30 [PATCHSET 0/3] Improve IRQ driven performance Jens Axboe
2021-12-15 16:30 ` [PATCH 1/3] block: add completion handler for fast path Jens Axboe
2021-12-15 16:30 ` [PATCH 2/3] block: use singly linked list for bio cache Jens Axboe
@ 2021-12-15 16:30 ` Jens Axboe
2021-12-16 14:33 ` Christoph Hellwig
2 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2021-12-15 16:30 UTC (permalink / raw)
To: io-uring, linux-block; +Cc: Jens Axboe
We currently cannot use the bio recycling allocation cache for IRQ driven
IO, as the cache isn't IRQ safe (by design).
Add a way for the completion side to pass back a bio that needs freeing,
so we can do it from the io_uring side. io_uring completions always
run in task context.
This is good for about a 13% improvement in IRQ driven IO, taking us from
around 6.3M/core to 7.1M/core IOPS.
Signed-off-by: Jens Axboe <[email protected]>
---
block/fops.c | 11 ++++++++---
fs/io_uring.c | 6 ++++++
include/linux/fs.h | 4 ++++
3 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/block/fops.c b/block/fops.c
index bcf866b07edc..c7794d42be85 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -296,14 +296,19 @@ static void blkdev_bio_end_io_async(struct bio *bio)
ret = blk_status_to_errno(bio->bi_status);
}
- iocb->ki_complete(iocb, ret);
-
if (dio->flags & DIO_SHOULD_DIRTY) {
bio_check_pages_dirty(bio);
} else {
bio_release_pages(bio, false);
- bio_put(bio);
+ if (iocb->ki_flags & IOCB_BIO_PASSBACK) {
+ iocb->ki_flags |= IOCB_PRIV_IS_BIO;
+ iocb->private = bio;
+ } else {
+ bio_put(bio);
+ }
}
+
+ iocb->ki_complete(iocb, ret);
}
static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1a647a6a5add..b0302c0407e6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2828,6 +2828,11 @@ static void io_req_task_complete(struct io_kiocb *req, bool *locked)
unsigned int cflags = io_put_kbuf(req);
int res = req->result;
+#ifdef CONFIG_BLOCK
+ if (req->rw.kiocb.ki_flags & IOCB_PRIV_IS_BIO)
+ bio_put(req->rw.kiocb.private);
+#endif
+
if (*locked) {
io_req_complete_state(req, res, cflags);
io_req_add_compl_list(req);
@@ -3024,6 +3029,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe)
} else {
if (kiocb->ki_flags & IOCB_HIPRI)
return -EINVAL;
+ kiocb->ki_flags |= IOCB_ALLOC_CACHE | IOCB_BIO_PASSBACK;
kiocb->ki_complete = io_complete_rw;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6b8dc1a78df6..bf7a76dfdc29 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -322,6 +322,10 @@ enum rw_hint {
#define IOCB_NOIO (1 << 20)
/* can use bio alloc cache */
#define IOCB_ALLOC_CACHE (1 << 21)
+/* iocb supports bio passback */
+#define IOCB_BIO_PASSBACK (1 << 22)
+/* iocb->private holds bio to put */
+#define IOCB_PRIV_IS_BIO (1 << 23)
struct kiocb {
struct file *ki_filp;
--
2.34.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 3/3] block: enable bio allocation cache for IRQ driven IO
2021-12-15 16:30 ` [PATCH 3/3] block: enable bio allocation cache for IRQ driven IO Jens Axboe
@ 2021-12-16 14:33 ` Christoph Hellwig
2021-12-16 15:41 ` Jens Axboe
0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2021-12-16 14:33 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring, linux-block
On Wed, Dec 15, 2021 at 09:30:09AM -0700, Jens Axboe wrote:
> We currently cannot use the bio recycling allocation cache for IRQ driven
> IO, as the cache isn't IRQ safe (by design).
>
> Add a way for the completion side to pass back a bio that needs freeing,
> so we can do it from the io_uring side. io_uring completions always
> run in task context.
>
> This is good for about a 13% improvement in IRQ driven IO, taking us from
> around 6.3M/core to 7.1M/core IOPS.
The numbers looks great, but I really hate how it ties the caller into
using a bio. I'll have to think hard about a better structure.
Just curious: are the numbers with retpolines or without? Do you care
about the cost of indirect calls with retpolines for these benchmarks?
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 3/3] block: enable bio allocation cache for IRQ driven IO
2021-12-16 14:33 ` Christoph Hellwig
@ 2021-12-16 15:41 ` Jens Axboe
0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2021-12-16 15:41 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: io-uring, linux-block
On 12/16/21 7:33 AM, Christoph Hellwig wrote:
> On Wed, Dec 15, 2021 at 09:30:09AM -0700, Jens Axboe wrote:
>> We currently cannot use the bio recycling allocation cache for IRQ driven
>> IO, as the cache isn't IRQ safe (by design).
>>
>> Add a way for the completion side to pass back a bio that needs freeing,
>> so we can do it from the io_uring side. io_uring completions always
>> run in task context.
>>
>> This is good for about a 13% improvement in IRQ driven IO, taking us from
>> around 6.3M/core to 7.1M/core IOPS.
>
> The numbers looks great, but I really hate how it ties the caller into
> using a bio. I'll have to think hard about a better structure.
Yeah it's definitely not the prettiest, but I haven't been able to come
up with a better approach so far. I don't think the bio association is
the worst, can't imagine we'd want to tie it to something else. I might
be wrong...
Ideally we'd flip the completion model upside down here, instead of
having ->bi_end_io() call ->ki_complete.
> Just curious: are the numbers with retpolines or without? Do you care
> about the cost of indirect calls with retpolines for these benchmarks?
No mitigations enabled for these tests, so no retpoline. I definitely do
care about indirect call performance. The peak testing so far has been
mostly focused on best case - various features disabled that are common
but costly, and no mitigations enabled. Mostly because it's necessary to
break down the issues one-by-one, and the intent has always been to move
towards making common options less expensive too. Most of the patches
cover both bases, but there are also cases where we want to fix just
items that are costly when features/mitigations are enabled.
--
Jens Axboe
^ permalink raw reply [flat|nested] 8+ messages in thread