public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH 0/4] clean up deferred tw wakeups
@ 2024-01-17  0:57 Pavel Begunkov
  2024-01-17  0:57 ` [PATCH 1/4] io_uring: adjust defer tw counting Pavel Begunkov
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Pavel Begunkov @ 2024-01-17  0:57 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

While reviewing io_req_local_work_add() I haven't found any real problems,
but there are defintely rought edges. Remove one extra smp_mb__after_atomic(),
add comments about how the synchronisation works, and improve mixing lazy with
non-lazy work items.

Pavel Begunkov (4):
  io_uring: adjust defer tw counting
  io_uring: clean up local tw add-wait sync
  io_uring: clean *local_work_add var naming
  io_uring: combine cq_wait_nr checks

 io_uring/io_uring.c | 58 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 42 insertions(+), 16 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/4] io_uring: adjust defer tw counting
  2024-01-17  0:57 [PATCH 0/4] clean up deferred tw wakeups Pavel Begunkov
@ 2024-01-17  0:57 ` Pavel Begunkov
  2024-01-17  0:57 ` [PATCH 2/4] io_uring: clean up local tw add-wait sync Pavel Begunkov
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Pavel Begunkov @ 2024-01-17  0:57 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

The UINT_MAX work item counting bias in io_req_local_work_add() in case
of !IOU_F_TWQ_LAZY_WAKE works in a sense that we will not miss a wake up,
however it's still eerie. In particular, if we add a lazy work item
after a non-lazy one, we'll increment it and get nr_tw==0, and
subsequent adds may try to unnecessarily wake up the task, which is
though not so likely to happen in real workloads.

Half the bias, it's still large enough to be larger than any valid
->cq_wait_nr, which is limited by IORING_MAX_CQ_ENTRIES, but further
have a good enough of space before it overflows.

Fixes: 8751d15426a31 ("io_uring: reduce scheduling due to tw")
Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 50c9f04bc193..d40c767a6216 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1325,7 +1325,7 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 		nr_tw = nr_tw_prev + 1;
 		/* Large enough to fail the nr_wait comparison below */
 		if (!(flags & IOU_F_TWQ_LAZY_WAKE))
-			nr_tw = -1U;
+			nr_tw = INT_MAX;
 
 		req->nr_tw = nr_tw;
 		req->io_task_work.node.next = first;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/4] io_uring: clean up local tw add-wait sync
  2024-01-17  0:57 [PATCH 0/4] clean up deferred tw wakeups Pavel Begunkov
  2024-01-17  0:57 ` [PATCH 1/4] io_uring: adjust defer tw counting Pavel Begunkov
@ 2024-01-17  0:57 ` Pavel Begunkov
  2024-01-17  0:57 ` [PATCH 3/4] io_uring: clean *local_work_add var naming Pavel Begunkov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Pavel Begunkov @ 2024-01-17  0:57 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Kill a smp_mb__after_atomic() right before wake_up, it's useless, and
add a comment explaining implicit barriers from cmpxchg and
synchronsation around ->cq_wait_nr with the waiter.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index d40c767a6216..3ab7e6a46149 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1332,6 +1332,14 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 	} while (!try_cmpxchg(&ctx->work_llist.first, &first,
 			      &req->io_task_work.node));
 
+	/*
+	 * cmpxchg implies a full barrier, which pairs with the barrier
+	 * in set_current_state() on the io_cqring_wait() side. It's used
+	 * to ensure that either we see updated ->cq_wait_nr, or waiters
+	 * going to sleep will observe the work added to the list, which
+	 * is similar to the wait/wawke task state sync.
+	 */
+
 	if (!first) {
 		if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
 			atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
@@ -1346,8 +1354,6 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 	/* either not enough or the previous add has already woken it up */
 	if (nr_wait > nr_tw || nr_tw_prev >= nr_wait)
 		return;
-	/* pairs with set_current_state() in io_cqring_wait() */
-	smp_mb__after_atomic();
 	wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/4] io_uring: clean *local_work_add var naming
  2024-01-17  0:57 [PATCH 0/4] clean up deferred tw wakeups Pavel Begunkov
  2024-01-17  0:57 ` [PATCH 1/4] io_uring: adjust defer tw counting Pavel Begunkov
  2024-01-17  0:57 ` [PATCH 2/4] io_uring: clean up local tw add-wait sync Pavel Begunkov
@ 2024-01-17  0:57 ` Pavel Begunkov
  2024-01-17  0:57 ` [PATCH 4/4] io_uring: combine cq_wait_nr checks Pavel Begunkov
  2024-01-17 14:59 ` [PATCH 0/4] clean up deferred tw wakeups Jens Axboe
  4 siblings, 0 replies; 6+ messages in thread
From: Pavel Begunkov @ 2024-01-17  0:57 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

if (!first) { ... }

While it reads as do something if it's not the first entry, it does
exactly the opposite because "first" here is a pointer to the first
entry. Remove the confusion by renaming it into "head".

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3ab7e6a46149..3508198d17ba 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1304,16 +1304,16 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	unsigned nr_wait, nr_tw, nr_tw_prev;
-	struct llist_node *first;
+	struct llist_node *head;
 
 	if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK))
 		flags &= ~IOU_F_TWQ_LAZY_WAKE;
 
-	first = READ_ONCE(ctx->work_llist.first);
+	head = READ_ONCE(ctx->work_llist.first);
 	do {
 		nr_tw_prev = 0;
-		if (first) {
-			struct io_kiocb *first_req = container_of(first,
+		if (head) {
+			struct io_kiocb *first_req = container_of(head,
 							struct io_kiocb,
 							io_task_work.node);
 			/*
@@ -1328,8 +1328,8 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 			nr_tw = INT_MAX;
 
 		req->nr_tw = nr_tw;
-		req->io_task_work.node.next = first;
-	} while (!try_cmpxchg(&ctx->work_llist.first, &first,
+		req->io_task_work.node.next = head;
+	} while (!try_cmpxchg(&ctx->work_llist.first, &head,
 			      &req->io_task_work.node));
 
 	/*
@@ -1340,7 +1340,7 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 	 * is similar to the wait/wawke task state sync.
 	 */
 
-	if (!first) {
+	if (!head) {
 		if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
 			atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
 		if (ctx->has_evfd)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 4/4] io_uring: combine cq_wait_nr checks
  2024-01-17  0:57 [PATCH 0/4] clean up deferred tw wakeups Pavel Begunkov
                   ` (2 preceding siblings ...)
  2024-01-17  0:57 ` [PATCH 3/4] io_uring: clean *local_work_add var naming Pavel Begunkov
@ 2024-01-17  0:57 ` Pavel Begunkov
  2024-01-17 14:59 ` [PATCH 0/4] clean up deferred tw wakeups Jens Axboe
  4 siblings, 0 replies; 6+ messages in thread
From: Pavel Begunkov @ 2024-01-17  0:57 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Instead of explicitly checking ->cq_wait_nr for whether there are
waiting, which is currently represented by 0, we can store there a
large value and the nr_tw will automatically filter out those cases.
Add a named constant for that and for the wake up bias value.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 34 +++++++++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3508198d17ba..b5fa3c7df1cf 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -137,6 +137,14 @@ struct io_defer_entry {
 #define IO_DISARM_MASK (REQ_F_ARM_LTIMEOUT | REQ_F_LINK_TIMEOUT | REQ_F_FAIL)
 #define IO_REQ_LINK_FLAGS (REQ_F_LINK | REQ_F_HARDLINK)
 
+/*
+ * No waiters. It's larger than any valid value of the tw counter
+ * so that tests against ->cq_wait_nr would fail and skip wake_up().
+ */
+#define IO_CQ_WAKE_INIT		(-1U)
+/* Forced wake up if there is a waiter regardless of ->cq_wait_nr */
+#define IO_CQ_WAKE_FORCE	(IO_CQ_WAKE_INIT >> 1)
+
 static bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 					 struct task_struct *task,
 					 bool cancel_all);
@@ -303,6 +311,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 		goto err;
 
 	ctx->flags = p->flags;
+	atomic_set(&ctx->cq_wait_nr, IO_CQ_WAKE_INIT);
 	init_waitqueue_head(&ctx->sqo_sq_wait);
 	INIT_LIST_HEAD(&ctx->sqd_list);
 	INIT_LIST_HEAD(&ctx->cq_overflow_list);
@@ -1306,6 +1315,13 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 	unsigned nr_wait, nr_tw, nr_tw_prev;
 	struct llist_node *head;
 
+	/* See comment above IO_CQ_WAKE_INIT */
+	BUILD_BUG_ON(IO_CQ_WAKE_FORCE <= IORING_MAX_CQ_ENTRIES);
+
+	/*
+	 * We don't know how many reuqests is there in the link and whether
+	 * they can even be queued lazily, fall back to non-lazy.
+	 */
 	if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK))
 		flags &= ~IOU_F_TWQ_LAZY_WAKE;
 
@@ -1322,10 +1338,14 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 			 */
 			nr_tw_prev = READ_ONCE(first_req->nr_tw);
 		}
+
+		/*
+		 * Theoretically, it can overflow, but that's fine as one of
+		 * previous adds should've tried to wake the task.
+		 */
 		nr_tw = nr_tw_prev + 1;
-		/* Large enough to fail the nr_wait comparison below */
 		if (!(flags & IOU_F_TWQ_LAZY_WAKE))
-			nr_tw = INT_MAX;
+			nr_tw = IO_CQ_WAKE_FORCE;
 
 		req->nr_tw = nr_tw;
 		req->io_task_work.node.next = head;
@@ -1348,11 +1368,11 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 	}
 
 	nr_wait = atomic_read(&ctx->cq_wait_nr);
-	/* no one is waiting */
-	if (!nr_wait)
+	/* not enough or no one is waiting */
+	if (nr_tw < nr_wait)
 		return;
-	/* either not enough or the previous add has already woken it up */
-	if (nr_wait > nr_tw || nr_tw_prev >= nr_wait)
+	/* the previous add has already woken it up */
+	if (nr_tw_prev >= nr_wait)
 		return;
 	wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
 }
@@ -2620,7 +2640,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 
 		ret = io_cqring_wait_schedule(ctx, &iowq);
 		__set_current_state(TASK_RUNNING);
-		atomic_set(&ctx->cq_wait_nr, 0);
+		atomic_set(&ctx->cq_wait_nr, IO_CQ_WAKE_INIT);
 
 		/*
 		 * Run task_work after scheduling and before io_should_wake().
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/4] clean up deferred tw wakeups
  2024-01-17  0:57 [PATCH 0/4] clean up deferred tw wakeups Pavel Begunkov
                   ` (3 preceding siblings ...)
  2024-01-17  0:57 ` [PATCH 4/4] io_uring: combine cq_wait_nr checks Pavel Begunkov
@ 2024-01-17 14:59 ` Jens Axboe
  4 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2024-01-17 14:59 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov


On Wed, 17 Jan 2024 00:57:25 +0000, Pavel Begunkov wrote:
> While reviewing io_req_local_work_add() I haven't found any real problems,
> but there are defintely rought edges. Remove one extra smp_mb__after_atomic(),
> add comments about how the synchronisation works, and improve mixing lazy with
> non-lazy work items.
> 
> Pavel Begunkov (4):
>   io_uring: adjust defer tw counting
>   io_uring: clean up local tw add-wait sync
>   io_uring: clean *local_work_add var naming
>   io_uring: combine cq_wait_nr checks
> 
> [...]

Applied, thanks!

[1/4] io_uring: adjust defer tw counting
      commit: c5121e33d343f1febaf2098d413a5ddfcaf5998f
[2/4] io_uring: clean up local tw add-wait sync
      commit: 71258398c2b829457859bc778887bf848db35f50
[3/4] io_uring: clean *local_work_add var naming
      commit: b67eaa9683e2a8fea427a4be2603c0f79416407d
[4/4] io_uring: combine cq_wait_nr checks
      commit: 06660b62dbaac1e5170f73b866d13081748d38d1

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-01-17 14:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-17  0:57 [PATCH 0/4] clean up deferred tw wakeups Pavel Begunkov
2024-01-17  0:57 ` [PATCH 1/4] io_uring: adjust defer tw counting Pavel Begunkov
2024-01-17  0:57 ` [PATCH 2/4] io_uring: clean up local tw add-wait sync Pavel Begunkov
2024-01-17  0:57 ` [PATCH 3/4] io_uring: clean *local_work_add var naming Pavel Begunkov
2024-01-17  0:57 ` [PATCH 4/4] io_uring: combine cq_wait_nr checks Pavel Begunkov
2024-01-17 14:59 ` [PATCH 0/4] clean up deferred tw wakeups Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox