[RFC] io_uring: optimise requests referencing ctx

public inbox for [email protected]
 help / color / mirror / Atom feed

* [RFC] io_uring: optimise requests referencing ctx
@ 2021-10-02 12:15 Pavel Begunkov
  2021-10-02 13:09 ` Jens Axboe
  0 siblings, 1 reply; 2+ messages in thread
From: Pavel Begunkov @ 2021-10-02 12:15 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Currenlty, we allocate one ctx reference per request at submission time
and put them at free. It's batched and not so expensive but it still
bloats the kernel, adds 2 function calls for rcu and adds some overhead
for request counting in io_free_batch_list().

Always keep one reference with a request, even when it's freed and in
io_uring request caches. There is extra work at ring exit / quiesce
paths, which now need to put all cached requests. io_ring_exit_work() is
already looping, so it's not a problem. Add hybrid-busy waiting to
io_ctx_quiesce() as well for now.

Signed-off-by: Pavel Begunkov <[email protected]>
---

I want to get rid of extra request ctx referencing, but across different
kernel versions have been getting "interesting" results loosing
performance for nops test. Thus, it's only RFC to see whether I'm the
only one seeing weird effects.



 fs/io_uring.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 98401ec46c12..e1877b5ccf26 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1805,7 +1805,6 @@ static void io_req_complete_post(struct io_kiocb *req, long res,
 		io_put_task(req->task, 1);
 		wq_list_add_head(&req->comp_list, &ctx->locked_free_list);
 		ctx->locked_free_nr++;
-		percpu_ref_put(&ctx->refs);
 	}
 	io_commit_cqring(ctx);
 	spin_unlock(&ctx->completion_lock);
@@ -1933,6 +1932,7 @@ static struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx)
 		ret = 1;
 	}
 
+	percpu_ref_get_many(&ctx->refs, ret);
 	for (i = 0; i < ret; i++) {
 		req = reqs[i];
 
@@ -1977,8 +1977,6 @@ static void __io_free_req(struct io_kiocb *req)
 	wq_list_add_head(&req->comp_list, &ctx->locked_free_list);
 	ctx->locked_free_nr++;
 	spin_unlock(&ctx->completion_lock);
-
-	percpu_ref_put(&ctx->refs);
 }
 
 static inline void io_remove_next_linked(struct io_kiocb *req)
@@ -2267,7 +2265,7 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
 	__must_hold(&ctx->uring_lock)
 {
 	struct task_struct *task = NULL;
-	int task_refs = 0, ctx_refs = 0;
+	int task_refs = 0;
 
 	do {
 		struct io_kiocb *req = container_of(node, struct io_kiocb,
@@ -2287,12 +2285,9 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
 			task_refs = 0;
 		}
 		task_refs++;
-		ctx_refs++;
 		wq_stack_add_head(&req->comp_list, &ctx->submit_state.free_list);
 	} while (node);
 
-	if (ctx_refs)
-		percpu_ref_put_many(&ctx->refs, ctx_refs);
 	if (task)
 		io_put_task(task, task_refs);
 }
@@ -7202,8 +7197,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 		return 0;
 	/* make sure SQ entry isn't read before tail */
 	nr = min3(nr, ctx->sq_entries, entries);
-	if (!percpu_ref_tryget_many(&ctx->refs, nr))
-		return -EAGAIN;
 	io_get_task_refs(nr);
 
 	io_submit_state_start(&ctx->submit_state, nr);
@@ -7233,7 +7226,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 		int unused = nr - ref_used;
 
 		current->io_uring->cached_refs += unused;
-		percpu_ref_put_many(&ctx->refs, unused);
 	}
 
 	io_submit_state_end(ctx);
@@ -9154,6 +9146,7 @@ static void io_destroy_buffers(struct io_ring_ctx *ctx)
 static void io_req_caches_free(struct io_ring_ctx *ctx)
 {
 	struct io_submit_state *state = &ctx->submit_state;
+	int nr;
 
 	mutex_lock(&ctx->uring_lock);
 	io_flush_cached_locked_reqs(ctx, state);
@@ -9165,7 +9158,10 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
 		node = wq_stack_extract(&state->free_list);
 		req = container_of(node, struct io_kiocb, comp_list);
 		kmem_cache_free(req_cachep, req);
+		nr++;
 	}
+	if (nr)
+		percpu_ref_put_many(&ctx->refs, nr);
 	mutex_unlock(&ctx->uring_lock);
 }
 
@@ -9335,6 +9331,8 @@ static void io_ring_exit_work(struct work_struct *work)
 			io_sq_thread_unpark(sqd);
 		}
 
+		io_req_caches_free(ctx);
+
 		if (WARN_ON_ONCE(time_after(jiffies, timeout))) {
 			/* there is little hope left, don't run it too often */
 			interval = HZ * 60;
@@ -10714,10 +10712,14 @@ static int io_ctx_quiesce(struct io_ring_ctx *ctx)
 	 */
 	mutex_unlock(&ctx->uring_lock);
 	do {
-		ret = wait_for_completion_interruptible(&ctx->ref_comp);
-		if (!ret)
+		ret = wait_for_completion_interruptible_timeout(&ctx->ref_comp, HZ);
+		if (ret) {
+			ret = min(0L, ret);
 			break;
+		}
+
 		ret = io_run_task_work_sig();
+		io_req_caches_free(ctx);
 	} while (ret >= 0);
 	mutex_lock(&ctx->uring_lock);
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [RFC] io_uring: optimise requests referencing ctx
  2021-10-02 12:15 [RFC] io_uring: optimise requests referencing ctx Pavel Begunkov
@ 2021-10-02 13:09 ` Jens Axboe
  0 siblings, 0 replies; 2+ messages in thread
From: Jens Axboe @ 2021-10-02 13:09 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 10/2/21 6:15 AM, Pavel Begunkov wrote:
> Currenlty, we allocate one ctx reference per request at submission time
> and put them at free. It's batched and not so expensive but it still
> bloats the kernel, adds 2 function calls for rcu and adds some overhead
> for request counting in io_free_batch_list().
> 
> Always keep one reference with a request, even when it's freed and in
> io_uring request caches. There is extra work at ring exit / quiesce
> paths, which now need to put all cached requests. io_ring_exit_work() is
> already looping, so it's not a problem. Add hybrid-busy waiting to
> io_ctx_quiesce() as well for now.
> 
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
> 
> I want to get rid of extra request ctx referencing, but across different
> kernel versions have been getting "interesting" results loosing
> performance for nops test. Thus, it's only RFC to see whether I'm the
> only one seeing weird effects.

I ran this through the usual peak per-core testing:

Setup 1: 3970X, this one ends up being core limited
Setup 2: 5950X, this one ends up being device limited

Peak-1-threads is:

taskset -c 16 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n1 /dev/nvme1n1

Peak-2-threads is:

taskset -c 0,16 t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 -n2 /dev/nvme2n1 /dev/nvme1n1

where 0/16 are thread siblings.

NOPS is:

taskset -c 16 t/io_uring -b512 -d128 -s32 -c32 -N1

Results are in IOPS, and peak-2-threads is only run on the faster box.

Setup/Test   |  Peak-1-thread   Peak-2-threads   NOPS   Diff
------------------------------------------------------------------
Setup 1 pre  |      3.81M            N/A         47.0M
Setup 1 post |      3.84M            N/A         47.6M  +0.8-1.2%
Setup 2 pre  |      5.11M            5.70M       70.3M
Setup 2 post |      5.17M            5.75M       73.1M  +1.2-4.0%

Looks like a nice win to me, on both setups.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-10-02 13:09 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-10-02 12:15 [RFC] io_uring: optimise requests referencing ctx Pavel Begunkov
2021-10-02 13:09 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox