[RFC] io_uring: wake up optimisations

public inbox for [email protected]
 help / color / mirror / Atom feed

* [RFC] io_uring: wake up optimisations
@ 2022-12-20 17:58 Pavel Begunkov
  2022-12-20 18:06 ` Pavel Begunkov
  0 siblings, 1 reply; 5+ messages in thread
From: Pavel Begunkov @ 2022-12-20 17:58 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

NOT FOR INCLUSION, needs some ring poll workarounds

Flush completions is done either from the submit syscall or by the
task_work, both are in the context of the submitter task, and when it
goes for a single threaded rings like implied by ->task_complete, there
won't be any waiters on ->cq_wait but the master task. That means that
there can be no tasks sleeping on cq_wait while we run
__io_submit_flush_completions() and so waking up can be skipped.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 16a323a9ff70..a57b9008807c 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -618,6 +618,25 @@ static inline void __io_cq_unlock_post(struct io_ring_ctx *ctx)
 	io_cqring_wake(ctx);
 }
 
+static inline void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
+	__releases(ctx->completion_lock)
+{
+	io_commit_cqring(ctx);
+	__io_cq_unlock(ctx);
+	io_commit_cqring_flush(ctx);
+
+	/*
+	 * As ->task_complete implies that the ring is single tasked, cq_wait
+	 * may only be waited on by the current in io_cqring_wait(), but since
+	 * it will re-check the wakeup conditions once we return we can safely
+	 * skip waking it up.
+	 */
+	if (!ctx->task_complete) {
+		smp_mb();
+		__io_cqring_wake(ctx);
+	}
+}
+
 void io_cq_unlock_post(struct io_ring_ctx *ctx)
 	__releases(ctx->completion_lock)
 {
@@ -1458,7 +1477,7 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 			}
 		}
 	}
-	__io_cq_unlock_post(ctx);
+	__io_cq_unlock_post_flush(ctx);
 
 	if (!wq_list_empty(&ctx->submit_state.compl_reqs)) {
 		io_free_batch_list(ctx, state->compl_reqs.first);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC] io_uring: wake up optimisations
  2022-12-20 17:58 [RFC] io_uring: wake up optimisations Pavel Begunkov
@ 2022-12-20 18:06 ` Pavel Begunkov
  2022-12-20 18:10   ` Jens Axboe
  0 siblings, 1 reply; 5+ messages in thread
From: Pavel Begunkov @ 2022-12-20 18:06 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

On 12/20/22 17:58, Pavel Begunkov wrote:
> NOT FOR INCLUSION, needs some ring poll workarounds
> 
> Flush completions is done either from the submit syscall or by the
> task_work, both are in the context of the submitter task, and when it
> goes for a single threaded rings like implied by ->task_complete, there
> won't be any waiters on ->cq_wait but the master task. That means that
> there can be no tasks sleeping on cq_wait while we run
> __io_submit_flush_completions() and so waking up can be skipped.

Not trivial to benchmark as we need something to emulate a task_work
coming in the middle of waiting. I used the diff below to complete nops
in tw and removed preliminary tw runs for the "in the middle of waiting"
part. IORING_SETUP_SKIP_CQWAKE controls whether we use optimisation or
not.

It gets around 15% more IOPS (6769526 -> 7803304), which correlates
to 10% of wakeup cost in profiles. Another interesting part is that
waitqueues are excessive for our purposes and we can replace cq_wait
with something less heavier, e.g. atomic bit set



diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 9d4c4078e8d0..5a4f03a4ea40 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -164,6 +164,7 @@ enum {
   * try to do it just before it is needed.
   */
  #define IORING_SETUP_DEFER_TASKRUN	(1U << 13)
+#define IORING_SETUP_SKIP_CQWAKE	(1U << 14)
  
  enum io_uring_op {
  	IORING_OP_NOP,
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index a57b9008807c..68556dea060b 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -631,7 +631,7 @@ static inline void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
  	 * it will re-check the wakeup conditions once we return we can safely
  	 * skip waking it up.
  	 */
-	if (!ctx->task_complete) {
+	if (!(ctx->flags & IORING_SETUP_SKIP_CQWAKE)) {
  		smp_mb();
  		__io_cqring_wake(ctx);
  	}
@@ -2519,18 +2519,6 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
  	if (!io_allowed_run_tw(ctx))
  		return -EEXIST;
  
-	do {
-		/* always run at least 1 task work to process local work */
-		ret = io_run_task_work_ctx(ctx);
-		if (ret < 0)
-			return ret;
-		io_cqring_overflow_flush(ctx);
-
-		/* if user messes with these they will just get an early return */
-		if (__io_cqring_events_user(ctx) >= min_events)
-			return 0;
-	} while (ret > 0);
-
  	if (sig) {
  #ifdef CONFIG_COMPAT
  		if (in_compat_syscall())
@@ -3345,16 +3333,6 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
  			mutex_unlock(&ctx->uring_lock);
  			goto out;
  		}
-		if (flags & IORING_ENTER_GETEVENTS) {
-			if (ctx->syscall_iopoll)
-				goto iopoll_locked;
-			/*
-			 * Ignore errors, we'll soon call io_cqring_wait() and
-			 * it should handle ownership problems if any.
-			 */
-			if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
-				(void)io_run_local_work_locked(ctx);
-		}
  		mutex_unlock(&ctx->uring_lock);
  	}
  
@@ -3721,7 +3699,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
  			IORING_SETUP_R_DISABLED | IORING_SETUP_SUBMIT_ALL |
  			IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG |
  			IORING_SETUP_SQE128 | IORING_SETUP_CQE32 |
-			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN))
+			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN |
+			IORING_SETUP_SKIP_CQWAKE))
  		return -EINVAL;
  
  	return io_uring_create(entries, &p, params);
diff --git a/io_uring/nop.c b/io_uring/nop.c
index d956599a3c1b..77c686de3eb2 100644
--- a/io_uring/nop.c
+++ b/io_uring/nop.c
@@ -20,6 +20,6 @@ int io_nop_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
   */
  int io_nop(struct io_kiocb *req, unsigned int issue_flags)
  {
-	io_req_set_res(req, 0, 0);
-	return IOU_OK;
+	io_req_queue_tw_complete(req, 0);
+	return IOU_ISSUE_SKIP_COMPLETE;
  }

-- 
Pavel Begunkov

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC] io_uring: wake up optimisations
  2022-12-20 18:06 ` Pavel Begunkov
@ 2022-12-20 18:10   ` Jens Axboe
  2022-12-20 19:12     ` Pavel Begunkov
  0 siblings, 1 reply; 5+ messages in thread
From: Jens Axboe @ 2022-12-20 18:10 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 12/20/22 11:06 AM, Pavel Begunkov wrote:
> On 12/20/22 17:58, Pavel Begunkov wrote:
>> NOT FOR INCLUSION, needs some ring poll workarounds
>>
>> Flush completions is done either from the submit syscall or by the
>> task_work, both are in the context of the submitter task, and when it
>> goes for a single threaded rings like implied by ->task_complete, there
>> won't be any waiters on ->cq_wait but the master task. That means that
>> there can be no tasks sleeping on cq_wait while we run
>> __io_submit_flush_completions() and so waking up can be skipped.
> 
> Not trivial to benchmark as we need something to emulate a task_work
> coming in the middle of waiting. I used the diff below to complete nops
> in tw and removed preliminary tw runs for the "in the middle of waiting"
> part. IORING_SETUP_SKIP_CQWAKE controls whether we use optimisation or
> not.
> 
> It gets around 15% more IOPS (6769526 -> 7803304), which correlates
> to 10% of wakeup cost in profiles. Another interesting part is that
> waitqueues are excessive for our purposes and we can replace cq_wait
> with something less heavier, e.g. atomic bit set

I was thinking something like that the other day, for most purposes
the wait infra is too heavy handed for our case. If we exclude poll
for a second, everything else is internal and eg doesn't need IRQ
safe locking at all. That's just one part of it. But I didn't have
a good idea for the poll() side of things, which would be required
to make some progress there.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] io_uring: wake up optimisations
  2022-12-20 18:10   ` Jens Axboe
@ 2022-12-20 19:12     ` Pavel Begunkov
  2022-12-20 19:22       ` Jens Axboe
  0 siblings, 1 reply; 5+ messages in thread
From: Pavel Begunkov @ 2022-12-20 19:12 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 12/20/22 18:10, Jens Axboe wrote:
> On 12/20/22 11:06 AM, Pavel Begunkov wrote:
>> On 12/20/22 17:58, Pavel Begunkov wrote:
>>> NOT FOR INCLUSION, needs some ring poll workarounds
>>>
>>> Flush completions is done either from the submit syscall or by the
>>> task_work, both are in the context of the submitter task, and when it
>>> goes for a single threaded rings like implied by ->task_complete, there
>>> won't be any waiters on ->cq_wait but the master task. That means that
>>> there can be no tasks sleeping on cq_wait while we run
>>> __io_submit_flush_completions() and so waking up can be skipped.
>>
>> Not trivial to benchmark as we need something to emulate a task_work
>> coming in the middle of waiting. I used the diff below to complete nops
>> in tw and removed preliminary tw runs for the "in the middle of waiting"
>> part. IORING_SETUP_SKIP_CQWAKE controls whether we use optimisation or
>> not.
>>
>> It gets around 15% more IOPS (6769526 -> 7803304), which correlates
>> to 10% of wakeup cost in profiles. Another interesting part is that
>> waitqueues are excessive for our purposes and we can replace cq_wait
>> with something less heavier, e.g. atomic bit set
> 
> I was thinking something like that the other day, for most purposes
> the wait infra is too heavy handed for our case. If we exclude poll
> for a second, everything else is internal and eg doesn't need IRQ
> safe locking at all. That's just one part of it. But I didn't have

Ring polling? We can move it to a separate waitqueue, probably with
some tricks to remove extra ifs from the hot path, which I'm
planning to add in v2.

> a good idea for the poll() side of things, which would be required
> to make some progress there.

I'll play with replacing waitqueues with a bitops, should save some
extra ~5% with the benchmark I used.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] io_uring: wake up optimisations
  2022-12-20 19:12     ` Pavel Begunkov
@ 2022-12-20 19:22       ` Jens Axboe
  0 siblings, 0 replies; 5+ messages in thread
From: Jens Axboe @ 2022-12-20 19:22 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 12/20/22 12:12?PM, Pavel Begunkov wrote:
> On 12/20/22 18:10, Jens Axboe wrote:
>> On 12/20/22 11:06?AM, Pavel Begunkov wrote:
>>> On 12/20/22 17:58, Pavel Begunkov wrote:
>>>> NOT FOR INCLUSION, needs some ring poll workarounds
>>>>
>>>> Flush completions is done either from the submit syscall or by the
>>>> task_work, both are in the context of the submitter task, and when it
>>>> goes for a single threaded rings like implied by ->task_complete, there
>>>> won't be any waiters on ->cq_wait but the master task. That means that
>>>> there can be no tasks sleeping on cq_wait while we run
>>>> __io_submit_flush_completions() and so waking up can be skipped.
>>>
>>> Not trivial to benchmark as we need something to emulate a task_work
>>> coming in the middle of waiting. I used the diff below to complete nops
>>> in tw and removed preliminary tw runs for the "in the middle of waiting"
>>> part. IORING_SETUP_SKIP_CQWAKE controls whether we use optimisation or
>>> not.
>>>
>>> It gets around 15% more IOPS (6769526 -> 7803304), which correlates
>>> to 10% of wakeup cost in profiles. Another interesting part is that
>>> waitqueues are excessive for our purposes and we can replace cq_wait
>>> with something less heavier, e.g. atomic bit set
>>
>> I was thinking something like that the other day, for most purposes
>> the wait infra is too heavy handed for our case. If we exclude poll
>> for a second, everything else is internal and eg doesn't need IRQ
>> safe locking at all. That's just one part of it. But I didn't have
> 
> Ring polling? We can move it to a separate waitqueue, probably with
> some tricks to remove extra ifs from the hot path, which I'm
> planning to add in v2.

Yes, polling on the ring itself. And that was my thinking too, leave
cq_wait just for that and then hide it behind <something something> to
make it hopefully almost free for when the ring isn't polled. I just
hadn't put any thought into what exactly that'd look like just yet.

>> a good idea for the poll() side of things, which would be required
>> to make some progress there.
> 
> I'll play with replacing waitqueues with a bitops, should save some
> extra ~5% with the benchmark I used.

Excellent, looking forward to seeing that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-12-20 19:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-20 17:58 [RFC] io_uring: wake up optimisations Pavel Begunkov
2022-12-20 18:06 ` Pavel Begunkov
2022-12-20 18:10   ` Jens Axboe
2022-12-20 19:12     ` Pavel Begunkov
2022-12-20 19:22       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox