From: Jens Axboe <[email protected]>
To: Xiaoguang Wang <[email protected]>,
[email protected]
Cc: Pavel Begunkov <[email protected]>,
joseph qi <[email protected]>,
Jiufei Xue <[email protected]>
Subject: Re: Is io_uring framework becoming bloated gradually? and introduces performace regression
Date: Mon, 11 May 2020 12:08:48 -0600 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 5/10/20 9:30 PM, Xiaoguang Wang wrote:
> hi,
>
>> On 5/8/20 9:18 AM, Xiaoguang Wang wrote:
>>> hi,
>>>
>>> This issue was found when I tested IORING_FEAT_FAST_POLL feature, with
>>> the newest upstream codes, indeed I find that io_uring's performace
>>> improvement is not obvious compared to epoll in my test environment,
>>> most of the time they are similar. Test cases basically comes from:
>>> https://github.com/frevib/io_uring-echo-server/blob/io-uring-feat-fast-poll/benchmarks/benchmarks.md.
>>> In above url, the author's test results shows that io_uring will get a
>>> big performace improvement compared to epoll. I'm still looking into
>>> why I don't get the big improvement, currently don't know why, but I
>>> find some obvious regression issue.
>>>
>>> I wrote a simple tool based io_uring nop operation to evaluate
>>> io_uring framework in v5.1 and 5.7.0-rc4+(jens's io_uring-5.7 branch),
>>> I see a obvious performace regression:
>>>
>>> v5.1 kernel:
>>> $sudo taskset -c 60 ./io_uring_nop_stress -r 300 # run 300 seconds
>>> total ios: 1832524960
>>> IOPS: 6108416
>>> 5.7.0-rc4+
>>> $sudo taskset -c 60 ./io_uring_nop_stress -r 300
>>> total ios: 1597672304
>>> IOPS: 5325574
>>> it's about 12% performance regression.
>>
>> For sure there's a bit more bloat in 5.7+ compared to the initial slim
>> version, and focus has been on features to a certain extent recently.
>> The poll rework for 5.7 will really improve performance for the
>> networked side though, so it's not like it's just piling on features
>> that add bloat.
>>
>> That said, I do think it's time for a revisit on overhead. It's been a
>> while since I've done my disk IO testing. The nop testing isn't _that_
>> interesting by itself, as a micro benchmark it does yield some results
>> though. Are you running on bare metal or in a VM?
> I run it on bare metal.
>
>>
>>> Using perf can see many performance bottlenecks, for example,
>>> io_submit_sqes is one. For now, I did't make many analysis yet, just
>>> have a look at io_submit_sqes(), there are many assignment operations
>>> in io_init_req(), but I'm not sure whether they are all needed when
>>> req is not needed to be punt to io-wq, for example,
>>> INIT_IO_WORK(&req->work, io_wq_submit_work); # a whole struct
>>> assignment from perf annotate tool, it's an expensive operation, I
>>> think reqs that use fast poll feature use task-work function, so the
>>> INIT_IO_WORK maybe not necessary.
>>
>> I'm sure there's some low hanging fruit there, and I'd love to take
>> patches for it.
> Ok, I'll try to improve it a bit, thanks.
>
>>
>>> Above is just one issue, what I worry is that whether io_uring is
>>> becoming more bloated gradually, and will not that better to aio. In
>>> https://kernel.dk/io_uring.pdf, it says that io_uring will eliminate
>>> 104 bytes copy compared to aio, but see currenct io_init_req(),
>>> io_uring maybe copy more, introducing more overhead? Or does we need
>>> to carefully re-design struct io_kiocb, to reduce overhead as soon as
>>> possible.
>>
>> The copy refers to the data structures coming in and out, both io_uring
>> and io_uring inititalize their main io_kiocb/aio_kiocb structure as
>> well. The io_uring is slightly bigger, but not much, and it's the same
>> number of cachelines. So should not be a huge difference there. The
>> copying on the aio side is basically first the pointer copy, then the
>> user side kiocb structure. io_uring doesn't need to do that. The
>> completion side is also slimmer. We also don't need as many system calls
>> to do the same thing, for example.
>>
>> So no, we should always been substantially slimmer than aio, just by the
>> very nature of the API.
> Really thanks for detailed explanations, now I see, feel good:)
>
>>
>> One major thing I've been thinking about for io_uring is io_kiocb
>> recycling. We're hitting the memory allocator for alloc+free for each
>> request, even though that can be somewhat amortized by doing batched
>> submissions, and polling for instance can also do batched frees. But I'm
>> pretty sure we can find some gains here by having some io_kiocb caching
>> that is persistent across operations.
> Yes, agree. From my above nop operation tests, using perf, it shows
> kmem_cache_free() and kmem_cache_alloc_bulk() introduce an considerable
> overhead.
Just did a quick'n dirty where we have a global per-cpu cache of
io_kiocbs, so we can recycle them.
I'm using my laptop for this testing, running in kvm. This isn't ideal,
as the spinlock/irq overhead is much larger there, but it's the same
across all runs.
Baseline performance, this is running your test app as-is, no changes:
# taskset -c 0 ./io_uring_nop_stress
total ios: 294158288
IOPS: 9805276
Then I disabled the bulk alloc optimization, so we don't get any
benefits of allocating 16 requests at the time:
# taskset -c 0 ./io_uring_nop_stress
total ios: 285341456
IOPS: 9511381
As expected, we're down a bit at that point, as we're allocating each
io_kiocb individually.
Then I tried with the quick'n dirty patch:
# taskset -c 0 ./io_uring_nop_stress
total ios: 325797040
IOPS: 10859901
and it's quite a bit faster. This is cheating a little bit, since these
are just nops and we don't have to disable local irqs while grabbing a
recycled entry. But just adding that and it should be safe to use, and
probably won't make much of a difference on actual hardware.
The main concern for me is how to manage this cache. We could just limit
it to X entries per cpu or something like that, but it'd be nice to be
able to respond to memory pressure.
Quick'n dirty below.
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 4c2fe06ae20b..b6a6d0f0f4e5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -673,6 +673,13 @@ struct io_submit_state {
unsigned int ios_left;
};
+struct io_kiocb_cache {
+ struct list_head list;
+ unsigned nr_free;
+};
+
+static struct io_kiocb_cache *alloc_cache;
+
struct io_op_def {
/* needs req->io allocated for deferral/async */
unsigned async_ctx : 1;
@@ -1293,6 +1300,29 @@ static struct io_kiocb *io_get_fallback_req(struct io_ring_ctx *ctx)
return NULL;
}
+static struct io_kiocb *io_cache_alloc(void)
+{
+ struct io_kiocb_cache *cache;
+ struct io_kiocb *req = NULL;
+
+ cache = this_cpu_ptr(alloc_cache);
+ if (!list_empty(&cache->list)) {
+ req = list_first_entry(&cache->list, struct io_kiocb, list);
+ list_del(&req->list);
+ return req;
+ }
+
+ return NULL;
+}
+
+static void io_cache_free(struct io_kiocb *req)
+{
+ struct io_kiocb_cache *cache;
+
+ cache = this_cpu_ptr(alloc_cache);
+ list_add(&req->list, &cache->list);
+}
+
static struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx,
struct io_submit_state *state)
{
@@ -1300,6 +1330,9 @@ static struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx,
struct io_kiocb *req;
if (!state) {
+ req = io_cache_alloc();
+ if (req)
+ return req;
req = kmem_cache_alloc(req_cachep, gfp);
if (unlikely(!req))
goto fallback;
@@ -1372,7 +1405,7 @@ static void __io_free_req(struct io_kiocb *req)
percpu_ref_put(&req->ctx->refs);
if (likely(!io_is_fallback_req(req)))
- kmem_cache_free(req_cachep, req);
+ io_cache_free(req);
else
clear_bit_unlock(0, (unsigned long *) &req->ctx->fallback_req);
}
@@ -5842,7 +5875,7 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
struct file *ring_file, int ring_fd, bool async)
{
- struct io_submit_state state, *statep = NULL;
+ //struct io_submit_state state, *statep = NULL;
struct io_kiocb *link = NULL;
int i, submitted = 0;
@@ -5859,10 +5892,12 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
if (!percpu_ref_tryget_many(&ctx->refs, nr))
return -EAGAIN;
+#if 0
if (nr > IO_PLUG_THRESHOLD) {
io_submit_state_start(&state, nr);
statep = &state;
}
+#endif
ctx->ring_fd = ring_fd;
ctx->ring_file = ring_file;
@@ -5877,14 +5912,14 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
io_consume_sqe(ctx);
break;
}
- req = io_alloc_req(ctx, statep);
+ req = io_alloc_req(ctx, NULL);
if (unlikely(!req)) {
if (!submitted)
submitted = -EAGAIN;
break;
}
- err = io_init_req(ctx, req, sqe, statep, async);
+ err = io_init_req(ctx, req, sqe, NULL, async);
io_consume_sqe(ctx);
/* will complete beyond this point, count as submitted */
submitted++;
@@ -5898,7 +5933,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
trace_io_uring_submit_sqe(ctx, req->opcode, req->user_data,
true, async);
- err = io_submit_sqe(req, sqe, statep, &link);
+ err = io_submit_sqe(req, sqe, NULL, &link);
if (err)
goto fail_req;
}
@@ -5910,8 +5945,10 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
}
if (link)
io_queue_link_head(link);
+#if 0
if (statep)
io_submit_state_end(&state);
+#endif
/* Commit SQ ring head once we've consumed and submitted all SQEs */
io_commit_sqring(ctx);
@@ -8108,6 +8145,8 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
static int __init io_uring_init(void)
{
+ int cpu;
+
#define __BUILD_BUG_VERIFY_ELEMENT(stype, eoffset, etype, ename) do { \
BUILD_BUG_ON(offsetof(stype, ename) != eoffset); \
BUILD_BUG_ON(sizeof(etype) != sizeof_field(stype, ename)); \
@@ -8147,6 +8186,15 @@ static int __init io_uring_init(void)
BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST);
BUILD_BUG_ON(__REQ_F_LAST_BIT >= 8 * sizeof(int));
req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
+
+ alloc_cache = alloc_percpu(struct io_kiocb_cache);
+ for_each_possible_cpu(cpu) {
+ struct io_kiocb_cache *cache;
+
+ cache = per_cpu_ptr(alloc_cache, cpu);
+ INIT_LIST_HEAD(&cache->list);
+ cache->nr_free = 0;
+ }
return 0;
};
__initcall(io_uring_init);
--
Jens Axboe
prev parent reply other threads:[~2020-05-11 18:08 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-08 15:18 Is io_uring framework becoming bloated gradually? and introduces performace regression Xiaoguang Wang
2020-05-08 16:37 ` Jens Axboe
2020-05-10 18:10 ` H. de Vries
2020-05-11 5:43 ` Xiaoguang Wang
2020-05-11 3:30 ` Xiaoguang Wang
2020-05-11 18:08 ` Jens Axboe [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox