public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCHSET 0/4] Use io_wq_work_list for task_work
@ 2024-03-26 18:42 Jens Axboe
  2024-03-26 18:42 ` [PATCH 1/4] io_uring: use the right type for work_llist empty check Jens Axboe
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Jens Axboe @ 2024-03-26 18:42 UTC (permalink / raw)
  To: io-uring

Hi,

This converts the deferred, normal, and fallback task_work to use a
normal io_wq_work_list, rather than an llist.

The main motivation behind this is to get rid of the need to reverse
the list once it's deleted and run. I tested this basic conversion of
just switching it from an llist to an io_wq_work_list with a spinlock,
and I don't see any benefits from the lockless list. And for cases where
we get a bursty addition of task_work, this approach is faster as it
avoids the need to iterate the list upfront while reversing it.

And this is less code and simpler, so I'd prefer to go that route.

 include/linux/io_uring_types.h |  13 +--
 io_uring/io_uring.c            | 175 ++++++++++++++++-----------------
 io_uring/io_uring.h            |   8 +-
 io_uring/sqpoll.c              |   8 +-
 io_uring/tctx.c                |   3 +-
 5 files changed, 102 insertions(+), 105 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] io_uring: use the right type for work_llist empty check
  2024-03-26 18:42 [PATCHSET 0/4] Use io_wq_work_list for task_work Jens Axboe
@ 2024-03-26 18:42 ` Jens Axboe
  2024-03-26 18:42 ` [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list Jens Axboe
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2024-03-26 18:42 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

io_task_work_pending() uses wq_list_empty() on ctx->work_llist, but it's
not an io_wq_work_list, it's a struct llist_head. They both have
->first as head-of-list, and it turns out the checks are identical. But
be proper and use the right helper.

Fixes: dac6a0eae793 ("io_uring: ensure iopoll runs local task work as well")
Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/io_uring.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index caf1f573bb87..27d039ddb05e 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -343,7 +343,7 @@ static inline int io_run_task_work(void)
 
 static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
 {
-	return task_work_pending(current) || !wq_list_empty(&ctx->work_llist);
+	return task_work_pending(current) || !llist_empty(&ctx->work_llist);
 }
 
 static inline void io_tw_lock(struct io_ring_ctx *ctx, struct io_tw_state *ts)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list
  2024-03-26 18:42 [PATCHSET 0/4] Use io_wq_work_list for task_work Jens Axboe
  2024-03-26 18:42 ` [PATCH 1/4] io_uring: use the right type for work_llist empty check Jens Axboe
@ 2024-03-26 18:42 ` Jens Axboe
  2024-03-27 13:24   ` Pavel Begunkov
  2024-03-26 18:42 ` [PATCH 3/4] io_uring: switch fallback work to io_wq_work_list Jens Axboe
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2024-03-26 18:42 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Lockless lists may be handy for some things, but they mean that items
are in the reverse order as we can only add to the head of the list.
That in turn means that iterating items on the list needs to reverse it
first, if it's sensitive to ordering between items on the list.

Switch the DEFER_TASKRUN work list from an llist to a normal
io_wq_work_list, and protect it with a lock. Then we can get rid of the
manual reversing of the list when running it, which takes considerable
cycles particularly for bursty task_work additions.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/io_uring_types.h |  11 ++--
 io_uring/io_uring.c            | 117 ++++++++++++---------------------
 io_uring/io_uring.h            |   4 +-
 3 files changed, 51 insertions(+), 81 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index aeb4639785b5..e51bf15196e4 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -329,7 +329,9 @@ struct io_ring_ctx {
 	 * regularly bounce b/w CPUs.
 	 */
 	struct {
-		struct llist_head	work_llist;
+		struct io_wq_work_list	work_list;
+		spinlock_t		work_lock;
+		int			work_items;
 		unsigned long		check_cq;
 		atomic_t		cq_wait_nr;
 		atomic_t		cq_timeouts;
@@ -559,7 +561,10 @@ enum {
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
 
 struct io_task_work {
-	struct llist_node		node;
+	union {
+		struct io_wq_work_node		node;
+		struct llist_node		llist_node;
+	};
 	io_req_tw_func_t		func;
 };
 
@@ -615,8 +620,6 @@ struct io_kiocb {
 	 */
 	u16				buf_index;
 
-	unsigned			nr_tw;
-
 	/* REQ_F_* flags */
 	io_req_flags_t			flags;
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 87d7d8bbf814..9c06911077db 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -249,7 +249,7 @@ static __cold void io_fallback_req_func(struct work_struct *work)
 
 	percpu_ref_get(&ctx->refs);
 	mutex_lock(&ctx->uring_lock);
-	llist_for_each_entry_safe(req, tmp, node, io_task_work.node)
+	llist_for_each_entry_safe(req, tmp, node, io_task_work.llist_node)
 		req->io_task_work.func(req, &ts);
 	io_submit_flush_completions(ctx);
 	mutex_unlock(&ctx->uring_lock);
@@ -330,7 +330,8 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_LIST_HEAD(&ctx->timeout_list);
 	INIT_LIST_HEAD(&ctx->ltimeout_list);
 	INIT_LIST_HEAD(&ctx->rsrc_ref_list);
-	init_llist_head(&ctx->work_llist);
+	INIT_WQ_LIST(&ctx->work_list);
+	spin_lock_init(&ctx->work_lock);
 	INIT_LIST_HEAD(&ctx->tctx_list);
 	ctx->submit_state.free_list.next = NULL;
 	INIT_WQ_LIST(&ctx->locked_free_list);
@@ -1135,7 +1136,7 @@ struct llist_node *io_handle_tw_list(struct llist_node *node,
 	do {
 		struct llist_node *next = node->next;
 		struct io_kiocb *req = container_of(node, struct io_kiocb,
-						    io_task_work.node);
+						    io_task_work.llist_node);
 
 		if (req->ctx != ctx) {
 			ctx_flush_and_put(ctx, &ts);
@@ -1159,20 +1160,6 @@ struct llist_node *io_handle_tw_list(struct llist_node *node,
 	return node;
 }
 
-/**
- * io_llist_xchg - swap all entries in a lock-less list
- * @head:	the head of lock-less list to delete all entries
- * @new:	new entry as the head of the list
- *
- * If list is empty, return NULL, otherwise, return the pointer to the first entry.
- * The order of entries returned is from the newest to the oldest added one.
- */
-static inline struct llist_node *io_llist_xchg(struct llist_head *head,
-					       struct llist_node *new)
-{
-	return xchg(&head->first, new);
-}
-
 static __cold void io_fallback_tw(struct io_uring_task *tctx, bool sync)
 {
 	struct llist_node *node = llist_del_all(&tctx->task_list);
@@ -1180,7 +1167,7 @@ static __cold void io_fallback_tw(struct io_uring_task *tctx, bool sync)
 	struct io_kiocb *req;
 
 	while (node) {
-		req = container_of(node, struct io_kiocb, io_task_work.node);
+		req = container_of(node, struct io_kiocb, io_task_work.llist_node);
 		node = node->next;
 		if (sync && last_ctx != req->ctx) {
 			if (last_ctx) {
@@ -1190,7 +1177,7 @@ static __cold void io_fallback_tw(struct io_uring_task *tctx, bool sync)
 			last_ctx = req->ctx;
 			percpu_ref_get(&last_ctx->refs);
 		}
-		if (llist_add(&req->io_task_work.node,
+		if (llist_add(&req->io_task_work.llist_node,
 			      &req->ctx->fallback_llist))
 			schedule_delayed_work(&req->ctx->fallback_work, 1);
 	}
@@ -1238,48 +1225,26 @@ void tctx_task_work(struct callback_head *cb)
 	WARN_ON_ONCE(ret);
 }
 
-static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
+static inline void io_req_local_work_add(struct io_kiocb *req, unsigned tw_flags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
-	unsigned nr_wait, nr_tw, nr_tw_prev;
-	struct llist_node *head;
+	unsigned nr_wait, nr_tw;
+	unsigned long flags;
 
 	/* See comment above IO_CQ_WAKE_INIT */
 	BUILD_BUG_ON(IO_CQ_WAKE_FORCE <= IORING_MAX_CQ_ENTRIES);
 
 	/*
-	 * We don't know how many reuqests is there in the link and whether
+	 * We don't know how many requests is there in the link and whether
 	 * they can even be queued lazily, fall back to non-lazy.
 	 */
 	if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK))
-		flags &= ~IOU_F_TWQ_LAZY_WAKE;
-
-	head = READ_ONCE(ctx->work_llist.first);
-	do {
-		nr_tw_prev = 0;
-		if (head) {
-			struct io_kiocb *first_req = container_of(head,
-							struct io_kiocb,
-							io_task_work.node);
-			/*
-			 * Might be executed at any moment, rely on
-			 * SLAB_TYPESAFE_BY_RCU to keep it alive.
-			 */
-			nr_tw_prev = READ_ONCE(first_req->nr_tw);
-		}
-
-		/*
-		 * Theoretically, it can overflow, but that's fine as one of
-		 * previous adds should've tried to wake the task.
-		 */
-		nr_tw = nr_tw_prev + 1;
-		if (!(flags & IOU_F_TWQ_LAZY_WAKE))
-			nr_tw = IO_CQ_WAKE_FORCE;
+		tw_flags &= ~IOU_F_TWQ_LAZY_WAKE;
 
-		req->nr_tw = nr_tw;
-		req->io_task_work.node.next = head;
-	} while (!try_cmpxchg(&ctx->work_llist.first, &head,
-			      &req->io_task_work.node));
+	spin_lock_irqsave(&ctx->work_lock, flags);
+	wq_list_add_tail(&req->io_task_work.node, &ctx->work_list);
+	nr_tw = ++ctx->work_items;
+	spin_unlock_irqrestore(&ctx->work_lock, flags);
 
 	/*
 	 * cmpxchg implies a full barrier, which pairs with the barrier
@@ -1289,7 +1254,7 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 	 * is similar to the wait/wawke task state sync.
 	 */
 
-	if (!head) {
+	if (nr_tw == 1) {
 		if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
 			atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
 		if (ctx->has_evfd)
@@ -1297,13 +1262,8 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 	}
 
 	nr_wait = atomic_read(&ctx->cq_wait_nr);
-	/* not enough or no one is waiting */
-	if (nr_tw < nr_wait)
-		return;
-	/* the previous add has already woken it up */
-	if (nr_tw_prev >= nr_wait)
-		return;
-	wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
+	if (nr_tw >= nr_wait)
+		wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
 }
 
 static void io_req_normal_work_add(struct io_kiocb *req)
@@ -1312,7 +1272,7 @@ static void io_req_normal_work_add(struct io_kiocb *req)
 	struct io_ring_ctx *ctx = req->ctx;
 
 	/* task_work already pending, we're done */
-	if (!llist_add(&req->io_task_work.node, &tctx->task_list))
+	if (!llist_add(&req->io_task_work.llist_node, &tctx->task_list))
 		return;
 
 	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
@@ -1346,9 +1306,15 @@ void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
 
 static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 {
-	struct llist_node *node;
+	struct io_wq_work_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ctx->work_lock, flags);
+	node = ctx->work_list.first;
+	INIT_WQ_LIST(&ctx->work_list);
+	ctx->work_items = 0;
+	spin_unlock_irqrestore(&ctx->work_lock, flags);
 
-	node = llist_del_all(&ctx->work_llist);
 	while (node) {
 		struct io_kiocb *req = container_of(node, struct io_kiocb,
 						    io_task_work.node);
@@ -1361,7 +1327,7 @@ static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
 				       int min_events)
 {
-	if (llist_empty(&ctx->work_llist))
+	if (wq_list_empty(&ctx->work_list))
 		return false;
 	if (events < min_events)
 		return true;
@@ -1373,7 +1339,7 @@ static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
 static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts,
 			       int min_events)
 {
-	struct llist_node *node;
+	struct io_wq_work_node *node;
 	unsigned int loops = 0;
 	int ret = 0;
 
@@ -1382,13 +1348,14 @@ static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts,
 	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
 		atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
 again:
-	/*
-	 * llists are in reverse order, flip it back the right way before
-	 * running the pending items.
-	 */
-	node = llist_reverse_order(io_llist_xchg(&ctx->work_llist, NULL));
+	spin_lock_irq(&ctx->work_lock);
+	node = ctx->work_list.first;
+	INIT_WQ_LIST(&ctx->work_list);
+	ctx->work_items = 0;
+	spin_unlock_irq(&ctx->work_lock);
+
 	while (node) {
-		struct llist_node *next = node->next;
+		struct io_wq_work_node *next = node->next;
 		struct io_kiocb *req = container_of(node, struct io_kiocb,
 						    io_task_work.node);
 		INDIRECT_CALL_2(req->io_task_work.func,
@@ -1414,7 +1381,7 @@ static inline int io_run_local_work_locked(struct io_ring_ctx *ctx,
 {
 	struct io_tw_state ts = {};
 
-	if (llist_empty(&ctx->work_llist))
+	if (wq_list_empty(&ctx->work_list))
 		return 0;
 	return __io_run_local_work(ctx, &ts, min_events);
 }
@@ -2426,7 +2393,7 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
 
 int io_run_task_work_sig(struct io_ring_ctx *ctx)
 {
-	if (!llist_empty(&ctx->work_llist)) {
+	if (!wq_list_empty(&ctx->work_list)) {
 		__set_current_state(TASK_RUNNING);
 		if (io_run_local_work(ctx, INT_MAX) > 0)
 			return 0;
@@ -2455,7 +2422,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 
 	if (unlikely(READ_ONCE(ctx->check_cq)))
 		return 1;
-	if (unlikely(!llist_empty(&ctx->work_llist)))
+	if (unlikely(!wq_list_empty(&ctx->work_list)))
 		return 1;
 	if (unlikely(test_thread_flag(TIF_NOTIFY_SIGNAL)))
 		return 1;
@@ -2494,7 +2461,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 
 	if (!io_allowed_run_tw(ctx))
 		return -EEXIST;
-	if (!llist_empty(&ctx->work_llist))
+	if (!wq_list_empty(&ctx->work_list))
 		io_run_local_work(ctx, min_events);
 	io_run_task_work();
 	io_cqring_overflow_flush(ctx);
@@ -2558,7 +2525,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 		 * now rather than let the caller do another wait loop.
 		 */
 		io_run_task_work();
-		if (!llist_empty(&ctx->work_llist))
+		if (!wq_list_empty(&ctx->work_list))
 			io_run_local_work(ctx, nr_wait);
 
 		/*
@@ -3331,7 +3298,7 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
 		io_run_task_work();
 		io_uring_drop_tctx_refs(current);
 		xa_for_each(&tctx->xa, index, node) {
-			if (!llist_empty(&node->ctx->work_llist)) {
+			if (!wq_list_empty(&node->ctx->work_list)) {
 				WARN_ON_ONCE(node->ctx->submitter_task &&
 					     node->ctx->submitter_task != current);
 				goto end_wait;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 27d039ddb05e..bb30a29d0e27 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -343,7 +343,7 @@ static inline int io_run_task_work(void)
 
 static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
 {
-	return task_work_pending(current) || !llist_empty(&ctx->work_llist);
+	return task_work_pending(current) || !wq_list_empty(&ctx->work_list);
 }
 
 static inline void io_tw_lock(struct io_ring_ctx *ctx, struct io_tw_state *ts)
@@ -457,6 +457,6 @@ enum {
 static inline bool io_has_work(struct io_ring_ctx *ctx)
 {
 	return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq) ||
-	       !llist_empty(&ctx->work_llist);
+	       !wq_list_empty(&ctx->work_list);
 }
 #endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] io_uring: switch fallback work to io_wq_work_list
  2024-03-26 18:42 [PATCHSET 0/4] Use io_wq_work_list for task_work Jens Axboe
  2024-03-26 18:42 ` [PATCH 1/4] io_uring: use the right type for work_llist empty check Jens Axboe
  2024-03-26 18:42 ` [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list Jens Axboe
@ 2024-03-26 18:42 ` Jens Axboe
  2024-03-26 18:42 ` [PATCH 4/4] io_uring: switch normal task_work " Jens Axboe
  2024-03-27 13:33 ` [PATCHSET 0/4] Use io_wq_work_list for task_work Pavel Begunkov
  4 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2024-03-26 18:42 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Just like what was done for deferred task_work, convert the fallback
task_work to a normal io_wq_work_list.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/io_uring_types.h |  2 +-
 io_uring/io_uring.c            | 24 +++++++++++++++++++-----
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index e51bf15196e4..2bc253f8147d 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -400,7 +400,7 @@ struct io_ring_ctx {
 	struct mm_struct		*mm_account;
 
 	/* ctx exit and cancelation */
-	struct llist_head		fallback_llist;
+	struct io_wq_work_list		fallback_list;
 	struct delayed_work		fallback_work;
 	struct work_struct		exit_work;
 	struct list_head		tctx_list;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 9c06911077db..8d7138eaa921 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -243,14 +243,22 @@ static __cold void io_fallback_req_func(struct work_struct *work)
 {
 	struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx,
 						fallback_work.work);
-	struct llist_node *node = llist_del_all(&ctx->fallback_llist);
-	struct io_kiocb *req, *tmp;
+	struct io_wq_work_node *node;
 	struct io_tw_state ts = {};
+	struct io_kiocb *req;
+
+	spin_lock_irq(&ctx->work_lock);
+	node = ctx->fallback_list.first;
+	INIT_WQ_LIST(&ctx->fallback_list);
+	spin_unlock_irq(&ctx->work_lock);
 
 	percpu_ref_get(&ctx->refs);
 	mutex_lock(&ctx->uring_lock);
-	llist_for_each_entry_safe(req, tmp, node, io_task_work.llist_node)
+	while (node) {
+		req = container_of(node, struct io_kiocb, io_task_work.node);
+		node = node->next;
 		req->io_task_work.func(req, &ts);
+	}
 	io_submit_flush_completions(ctx);
 	mutex_unlock(&ctx->uring_lock);
 	percpu_ref_put(&ctx->refs);
@@ -1167,6 +1175,9 @@ static __cold void io_fallback_tw(struct io_uring_task *tctx, bool sync)
 	struct io_kiocb *req;
 
 	while (node) {
+		unsigned long flags;
+		bool do_wake;
+
 		req = container_of(node, struct io_kiocb, io_task_work.llist_node);
 		node = node->next;
 		if (sync && last_ctx != req->ctx) {
@@ -1177,8 +1188,11 @@ static __cold void io_fallback_tw(struct io_uring_task *tctx, bool sync)
 			last_ctx = req->ctx;
 			percpu_ref_get(&last_ctx->refs);
 		}
-		if (llist_add(&req->io_task_work.llist_node,
-			      &req->ctx->fallback_llist))
+		spin_lock_irqsave(&req->ctx->work_lock, flags);
+		do_wake = wq_list_empty(&req->ctx->fallback_list);
+		wq_list_add_tail(&req->io_task_work.node, &req->ctx->fallback_list);
+		spin_unlock_irqrestore(&req->ctx->work_lock, flags);
+		if (do_wake)
 			schedule_delayed_work(&req->ctx->fallback_work, 1);
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] io_uring: switch normal task_work to io_wq_work_list
  2024-03-26 18:42 [PATCHSET 0/4] Use io_wq_work_list for task_work Jens Axboe
                   ` (2 preceding siblings ...)
  2024-03-26 18:42 ` [PATCH 3/4] io_uring: switch fallback work to io_wq_work_list Jens Axboe
@ 2024-03-26 18:42 ` Jens Axboe
  2024-03-27 13:33 ` [PATCHSET 0/4] Use io_wq_work_list for task_work Pavel Begunkov
  4 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2024-03-26 18:42 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

This concludes the transition, now all times of task_work are using the
same mechanism.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/io_uring_types.h |  8 ++----
 io_uring/io_uring.c            | 50 ++++++++++++++++++++++------------
 io_uring/io_uring.h            |  4 +--
 io_uring/sqpoll.c              |  8 +++---
 io_uring/tctx.c                |  3 +-
 5 files changed, 43 insertions(+), 30 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 2bc253f8147d..f46f871c09fe 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -95,7 +95,8 @@ struct io_uring_task {
 	struct percpu_counter		inflight;
 
 	struct { /* task_work */
-		struct llist_head	task_list;
+		struct io_wq_work_list	task_list;
+		spinlock_t		task_lock;
 		struct callback_head	task_work;
 	} ____cacheline_aligned_in_smp;
 };
@@ -561,10 +562,7 @@ enum {
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
 
 struct io_task_work {
-	union {
-		struct io_wq_work_node		node;
-		struct llist_node		llist_node;
-	};
+	struct io_wq_work_node		node;
 	io_req_tw_func_t		func;
 };
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 8d7138eaa921..e12b518e0b84 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1134,17 +1134,17 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, struct io_tw_state *ts)
  * If more entries than max_entries are available, stop processing once this
  * is reached and return the rest of the list.
  */
-struct llist_node *io_handle_tw_list(struct llist_node *node,
-				     unsigned int *count,
-				     unsigned int max_entries)
+struct io_wq_work_node *io_handle_tw_list(struct io_wq_work_node *node,
+					  unsigned int *count,
+					  unsigned int max_entries)
 {
 	struct io_ring_ctx *ctx = NULL;
 	struct io_tw_state ts = { };
 
 	do {
-		struct llist_node *next = node->next;
+		struct io_wq_work_node *next = node->next;
 		struct io_kiocb *req = container_of(node, struct io_kiocb,
-						    io_task_work.llist_node);
+						    io_task_work.node);
 
 		if (req->ctx != ctx) {
 			ctx_flush_and_put(ctx, &ts);
@@ -1170,15 +1170,20 @@ struct llist_node *io_handle_tw_list(struct llist_node *node,
 
 static __cold void io_fallback_tw(struct io_uring_task *tctx, bool sync)
 {
-	struct llist_node *node = llist_del_all(&tctx->task_list);
 	struct io_ring_ctx *last_ctx = NULL;
+	struct io_wq_work_node *node;
 	struct io_kiocb *req;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tctx->task_lock, flags);
+	node = tctx->task_list.first;
+	INIT_WQ_LIST(&tctx->task_list);
+	spin_unlock_irqrestore(&tctx->task_lock, flags);
 
 	while (node) {
-		unsigned long flags;
 		bool do_wake;
 
-		req = container_of(node, struct io_kiocb, io_task_work.llist_node);
+		req = container_of(node, struct io_kiocb, io_task_work.node);
 		node = node->next;
 		if (sync && last_ctx != req->ctx) {
 			if (last_ctx) {
@@ -1202,22 +1207,24 @@ static __cold void io_fallback_tw(struct io_uring_task *tctx, bool sync)
 	}
 }
 
-struct llist_node *tctx_task_work_run(struct io_uring_task *tctx,
-				      unsigned int max_entries,
-				      unsigned int *count)
+struct io_wq_work_node *tctx_task_work_run(struct io_uring_task *tctx,
+					   unsigned int max_entries,
+					   unsigned int *count)
 {
-	struct llist_node *node;
+	struct io_wq_work_node *node;
 
 	if (unlikely(current->flags & PF_EXITING)) {
 		io_fallback_tw(tctx, true);
 		return NULL;
 	}
 
-	node = llist_del_all(&tctx->task_list);
-	if (node) {
-		node = llist_reverse_order(node);
+	spin_lock_irq(&tctx->task_lock);
+	node = tctx->task_list.first;
+	INIT_WQ_LIST(&tctx->task_list);
+	spin_unlock_irq(&tctx->task_lock);
+
+	if (node)
 		node = io_handle_tw_list(node, count, max_entries);
-	}
 
 	/* relaxed read is enough as only the task itself sets ->in_cancel */
 	if (unlikely(atomic_read(&tctx->in_cancel)))
@@ -1229,8 +1236,8 @@ struct llist_node *tctx_task_work_run(struct io_uring_task *tctx,
 
 void tctx_task_work(struct callback_head *cb)
 {
+	struct io_wq_work_node *ret;
 	struct io_uring_task *tctx;
-	struct llist_node *ret;
 	unsigned int count = 0;
 
 	tctx = container_of(cb, struct io_uring_task, task_work);
@@ -1284,9 +1291,16 @@ static void io_req_normal_work_add(struct io_kiocb *req)
 {
 	struct io_uring_task *tctx = req->task->io_uring;
 	struct io_ring_ctx *ctx = req->ctx;
+	unsigned long flags;
+	bool was_empty;
+
+	spin_lock_irqsave(&tctx->task_lock, flags);
+	was_empty = wq_list_empty(&tctx->task_list);
+	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
+	spin_unlock_irqrestore(&tctx->task_lock, flags);
 
 	/* task_work already pending, we're done */
-	if (!llist_add(&req->io_task_work.llist_node, &tctx->task_list))
+	if (!was_empty)
 		return;
 
 	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index bb30a29d0e27..e1582529bc58 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -87,8 +87,8 @@ void io_req_task_queue(struct io_kiocb *req);
 void io_req_task_complete(struct io_kiocb *req, struct io_tw_state *ts);
 void io_req_task_queue_fail(struct io_kiocb *req, int ret);
 void io_req_task_submit(struct io_kiocb *req, struct io_tw_state *ts);
-struct llist_node *io_handle_tw_list(struct llist_node *node, unsigned int *count, unsigned int max_entries);
-struct llist_node *tctx_task_work_run(struct io_uring_task *tctx, unsigned int max_entries, unsigned int *count);
+struct io_wq_work_node *io_handle_tw_list(struct io_wq_work_node *node, unsigned int *count, unsigned int max_entries);
+struct io_wq_work_node *tctx_task_work_run(struct io_uring_task *tctx, unsigned int max_entries, unsigned int *count);
 void tctx_task_work(struct callback_head *cb);
 __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd);
 int io_uring_alloc_task_context(struct task_struct *task,
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index 3983708cef5b..3a34b867d5c0 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -230,7 +230,7 @@ static bool io_sqd_handle_event(struct io_sq_data *sqd)
  * than we were asked to process. Newly queued task_work isn't run until the
  * retry list has been fully processed.
  */
-static unsigned int io_sq_tw(struct llist_node **retry_list, int max_entries)
+static unsigned int io_sq_tw(struct io_wq_work_node **retry_list, int max_entries)
 {
 	struct io_uring_task *tctx = current->io_uring;
 	unsigned int count = 0;
@@ -246,11 +246,11 @@ static unsigned int io_sq_tw(struct llist_node **retry_list, int max_entries)
 	return count;
 }
 
-static bool io_sq_tw_pending(struct llist_node *retry_list)
+static bool io_sq_tw_pending(struct io_wq_work_node *retry_list)
 {
 	struct io_uring_task *tctx = current->io_uring;
 
-	return retry_list || !llist_empty(&tctx->task_list);
+	return retry_list || !wq_list_empty(&tctx->task_list);
 }
 
 static void io_sq_update_worktime(struct io_sq_data *sqd, struct rusage *start)
@@ -266,7 +266,7 @@ static void io_sq_update_worktime(struct io_sq_data *sqd, struct rusage *start)
 
 static int io_sq_thread(void *data)
 {
-	struct llist_node *retry_list = NULL;
+	struct io_wq_work_node *retry_list = NULL;
 	struct io_sq_data *sqd = data;
 	struct io_ring_ctx *ctx;
 	struct rusage start;
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index c043fe93a3f2..9bc0e203b780 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -86,7 +86,8 @@ __cold int io_uring_alloc_task_context(struct task_struct *task,
 	atomic_set(&tctx->in_cancel, 0);
 	atomic_set(&tctx->inflight_tracked, 0);
 	task->io_uring = tctx;
-	init_llist_head(&tctx->task_list);
+	INIT_WQ_LIST(&tctx->task_list);
+	spin_lock_init(&tctx->task_lock);
 	init_task_work(&tctx->task_work, tctx_task_work);
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list
  2024-03-26 18:42 ` [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list Jens Axboe
@ 2024-03-27 13:24   ` Pavel Begunkov
  2024-03-27 15:45     ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2024-03-27 13:24 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 3/26/24 18:42, Jens Axboe wrote:
> Lockless lists may be handy for some things, but they mean that items
> are in the reverse order as we can only add to the head of the list.
> That in turn means that iterating items on the list needs to reverse it
> first, if it's sensitive to ordering between items on the list.
> 
> Switch the DEFER_TASKRUN work list from an llist to a normal
> io_wq_work_list, and protect it with a lock. Then we can get rid of the
> manual reversing of the list when running it, which takes considerable
> cycles particularly for bursty task_work additions.
> 
> Signed-off-by: Jens Axboe <[email protected]>
> ---
>   include/linux/io_uring_types.h |  11 ++--
>   io_uring/io_uring.c            | 117 ++++++++++++---------------------
>   io_uring/io_uring.h            |   4 +-
>   3 files changed, 51 insertions(+), 81 deletions(-)
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index aeb4639785b5..e51bf15196e4 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -329,7 +329,9 @@ struct io_ring_ctx {
>   	 * regularly bounce b/w CPUs.

...

> -static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
> +static inline void io_req_local_work_add(struct io_kiocb *req, unsigned tw_flags)
>   {
>   	struct io_ring_ctx *ctx = req->ctx;
> -	unsigned nr_wait, nr_tw, nr_tw_prev;
> -	struct llist_node *head;
> +	unsigned nr_wait, nr_tw;
> +	unsigned long flags;
>   
>   	/* See comment above IO_CQ_WAKE_INIT */
>   	BUILD_BUG_ON(IO_CQ_WAKE_FORCE <= IORING_MAX_CQ_ENTRIES);
>   
>   	/*
> -	 * We don't know how many reuqests is there in the link and whether
> +	 * We don't know how many requests is there in the link and whether
>   	 * they can even be queued lazily, fall back to non-lazy.
>   	 */
>   	if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK))
> -		flags &= ~IOU_F_TWQ_LAZY_WAKE;
> -
> -	head = READ_ONCE(ctx->work_llist.first);
> -	do {
> -		nr_tw_prev = 0;
> -		if (head) {
> -			struct io_kiocb *first_req = container_of(head,
> -							struct io_kiocb,
> -							io_task_work.node);
> -			/*
> -			 * Might be executed at any moment, rely on
> -			 * SLAB_TYPESAFE_BY_RCU to keep it alive.
> -			 */
> -			nr_tw_prev = READ_ONCE(first_req->nr_tw);
> -		}
> -
> -		/*
> -		 * Theoretically, it can overflow, but that's fine as one of
> -		 * previous adds should've tried to wake the task.
> -		 */
> -		nr_tw = nr_tw_prev + 1;
> -		if (!(flags & IOU_F_TWQ_LAZY_WAKE))
> -			nr_tw = IO_CQ_WAKE_FORCE;

Aren't you just killing the entire IOU_F_TWQ_LAZY_WAKE handling?
It's assigned to IO_CQ_WAKE_FORCE so that it passes the check
before wake_up below.

> +		tw_flags &= ~IOU_F_TWQ_LAZY_WAKE;
>   
> -		req->nr_tw = nr_tw;
> -		req->io_task_work.node.next = head;
> -	} while (!try_cmpxchg(&ctx->work_llist.first, &head,
> -			      &req->io_task_work.node));
> +	spin_lock_irqsave(&ctx->work_lock, flags);
> +	wq_list_add_tail(&req->io_task_work.node, &ctx->work_list);
> +	nr_tw = ++ctx->work_items;
> +	spin_unlock_irqrestore(&ctx->work_lock, flags);

smp_mb(), see the comment below, and fwiw "_after_atomic" would not
work.

>   	/*
>   	 * cmpxchg implies a full barrier, which pairs with the barrier
> @@ -1289,7 +1254,7 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>   	 * is similar to the wait/wawke task state sync.
>   	 */
>   
> -	if (!head) {
> +	if (nr_tw == 1) {
>   		if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
>   			atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
>   		if (ctx->has_evfd)
> @@ -1297,13 +1262,8 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>   	}
>   
>   	nr_wait = atomic_read(&ctx->cq_wait_nr);
> -	/* not enough or no one is waiting */
> -	if (nr_tw < nr_wait)
> -		return;
> -	/* the previous add has already woken it up */
> -	if (nr_tw_prev >= nr_wait)
> -		return;
> -	wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
> +	if (nr_tw >= nr_wait)
> +		wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);

IIUC, you're removing a very important optimisation, and I
don't understand why'd you do that. It was waking up only when
it's changing from "don't need to wake" to "have to be woken up",
just once per splicing the list on the waiting side.

It seems like this one will keep pounding with wake_up_state for
every single tw queued after @nr_wait, which quite often just 1.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHSET 0/4] Use io_wq_work_list for task_work
  2024-03-26 18:42 [PATCHSET 0/4] Use io_wq_work_list for task_work Jens Axboe
                   ` (3 preceding siblings ...)
  2024-03-26 18:42 ` [PATCH 4/4] io_uring: switch normal task_work " Jens Axboe
@ 2024-03-27 13:33 ` Pavel Begunkov
  2024-03-27 16:36   ` Jens Axboe
  4 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2024-03-27 13:33 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 3/26/24 18:42, Jens Axboe wrote:
> Hi,
> 
> This converts the deferred, normal, and fallback task_work to use a
> normal io_wq_work_list, rather than an llist.
> 
> The main motivation behind this is to get rid of the need to reverse
> the list once it's deleted and run. I tested this basic conversion of
> just switching it from an llist to an io_wq_work_list with a spinlock,
> and I don't see any benefits from the lockless list. And for cases where
> we get a bursty addition of task_work, this approach is faster as it
> avoids the need to iterate the list upfront while reversing it.

I'm curious how you benchmarked it including accounting of irq/softirq
where tw add usually happens?

One known problem with the current list approach I mentioned several
times before is that it peeks at the previous queued tw to count them.
It's not nice, but that can be easily done with cmpxchg double. I
wonder how much of an issue is that.

> And this is less code and simpler, so I'd prefer to go that route.

I'm not sure it's less code, if you return optimisations that I
believe were killed, see comments to patch 2, it might turn out to
be even bulkier and not that simpler.


>   include/linux/io_uring_types.h |  13 +--
>   io_uring/io_uring.c            | 175 ++++++++++++++++-----------------
>   io_uring/io_uring.h            |   8 +-
>   io_uring/sqpoll.c              |   8 +-
>   io_uring/tctx.c                |   3 +-
>   5 files changed, 102 insertions(+), 105 deletions(-)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list
  2024-03-27 13:24   ` Pavel Begunkov
@ 2024-03-27 15:45     ` Jens Axboe
  2024-03-27 16:37       ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2024-03-27 15:45 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 3/27/24 7:24 AM, Pavel Begunkov wrote:
> On 3/26/24 18:42, Jens Axboe wrote:
>> Lockless lists may be handy for some things, but they mean that items
>> are in the reverse order as we can only add to the head of the list.
>> That in turn means that iterating items on the list needs to reverse it
>> first, if it's sensitive to ordering between items on the list.
>>
>> Switch the DEFER_TASKRUN work list from an llist to a normal
>> io_wq_work_list, and protect it with a lock. Then we can get rid of the
>> manual reversing of the list when running it, which takes considerable
>> cycles particularly for bursty task_work additions.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>>   include/linux/io_uring_types.h |  11 ++--
>>   io_uring/io_uring.c            | 117 ++++++++++++---------------------
>>   io_uring/io_uring.h            |   4 +-
>>   3 files changed, 51 insertions(+), 81 deletions(-)
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index aeb4639785b5..e51bf15196e4 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -329,7 +329,9 @@ struct io_ring_ctx {
>>        * regularly bounce b/w CPUs.
> 
> ...
> 
>> -static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>> +static inline void io_req_local_work_add(struct io_kiocb *req, unsigned tw_flags)
>>   {
>>       struct io_ring_ctx *ctx = req->ctx;
>> -    unsigned nr_wait, nr_tw, nr_tw_prev;
>> -    struct llist_node *head;
>> +    unsigned nr_wait, nr_tw;
>> +    unsigned long flags;
>>         /* See comment above IO_CQ_WAKE_INIT */
>>       BUILD_BUG_ON(IO_CQ_WAKE_FORCE <= IORING_MAX_CQ_ENTRIES);
>>         /*
>> -     * We don't know how many reuqests is there in the link and whether
>> +     * We don't know how many requests is there in the link and whether
>>        * they can even be queued lazily, fall back to non-lazy.
>>        */
>>       if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK))
>> -        flags &= ~IOU_F_TWQ_LAZY_WAKE;
>> -
>> -    head = READ_ONCE(ctx->work_llist.first);
>> -    do {
>> -        nr_tw_prev = 0;
>> -        if (head) {
>> -            struct io_kiocb *first_req = container_of(head,
>> -                            struct io_kiocb,
>> -                            io_task_work.node);
>> -            /*
>> -             * Might be executed at any moment, rely on
>> -             * SLAB_TYPESAFE_BY_RCU to keep it alive.
>> -             */
>> -            nr_tw_prev = READ_ONCE(first_req->nr_tw);
>> -        }
>> -
>> -        /*
>> -         * Theoretically, it can overflow, but that's fine as one of
>> -         * previous adds should've tried to wake the task.
>> -         */
>> -        nr_tw = nr_tw_prev + 1;
>> -        if (!(flags & IOU_F_TWQ_LAZY_WAKE))
>> -            nr_tw = IO_CQ_WAKE_FORCE;
> 
> Aren't you just killing the entire IOU_F_TWQ_LAZY_WAKE handling?
> It's assigned to IO_CQ_WAKE_FORCE so that it passes the check
> before wake_up below.

Yeah I messed that one up, did fix that one yesterday before sending it
out.

>> +        tw_flags &= ~IOU_F_TWQ_LAZY_WAKE;
>>   -        req->nr_tw = nr_tw;
>> -        req->io_task_work.node.next = head;
>> -    } while (!try_cmpxchg(&ctx->work_llist.first, &head,
>> -                  &req->io_task_work.node));
>> +    spin_lock_irqsave(&ctx->work_lock, flags);
>> +    wq_list_add_tail(&req->io_task_work.node, &ctx->work_list);
>> +    nr_tw = ++ctx->work_items;
>> +    spin_unlock_irqrestore(&ctx->work_lock, flags);
> 
> smp_mb(), see the comment below, and fwiw "_after_atomic" would not
> work.

For this one, I think all we need to do is have the wq_list_empty()
check be fully stable. If we read:

nr_wait = atomic_read(&ctx->cq_wait_nr);

right before a waiter does:

atomic_set(&ctx->cq_wait_nr, foo);
set_current_state(TASK_INTERRUPTIBLE);

then we need to ensure that the "I have work" check in
io_cqring_wait_schedule() sees the work. The spin_unlock() has release
semantics, and the current READ_ONCE() for work check sbould be enough,
no?

>>       /*
>>        * cmpxchg implies a full barrier, which pairs with the barrier
>> @@ -1289,7 +1254,7 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>>        * is similar to the wait/wawke task state sync.
>>        */
>>   -    if (!head) {
>> +    if (nr_tw == 1) {
>>           if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
>>               atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
>>           if (ctx->has_evfd)
>> @@ -1297,13 +1262,8 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>>       }
>>         nr_wait = atomic_read(&ctx->cq_wait_nr);
>> -    /* not enough or no one is waiting */
>> -    if (nr_tw < nr_wait)
>> -        return;
>> -    /* the previous add has already woken it up */
>> -    if (nr_tw_prev >= nr_wait)
>> -        return;
>> -    wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
>> +    if (nr_tw >= nr_wait)
>> +        wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
> 
> IIUC, you're removing a very important optimisation, and I
> don't understand why'd you do that. It was waking up only when
> it's changing from "don't need to wake" to "have to be woken up",
> just once per splicing the list on the waiting side.
> 
> It seems like this one will keep pounding with wake_up_state for
> every single tw queued after @nr_wait, which quite often just 1.

Agree, that was just a silly oversight. I brought that back now.
Apparently doesn't hit anything here, at least not to the extent that I
saw it in testing. But it is a good idea and we should keep that, so
only the first one over the threshold attempts the wake.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHSET 0/4] Use io_wq_work_list for task_work
  2024-03-27 13:33 ` [PATCHSET 0/4] Use io_wq_work_list for task_work Pavel Begunkov
@ 2024-03-27 16:36   ` Jens Axboe
  2024-03-27 17:05     ` Jens Axboe
  2024-03-27 18:04     ` Pavel Begunkov
  0 siblings, 2 replies; 14+ messages in thread
From: Jens Axboe @ 2024-03-27 16:36 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 3/27/24 7:33 AM, Pavel Begunkov wrote:
> On 3/26/24 18:42, Jens Axboe wrote:
>> Hi,
>>
>> This converts the deferred, normal, and fallback task_work to use a
>> normal io_wq_work_list, rather than an llist.
>>
>> The main motivation behind this is to get rid of the need to reverse
>> the list once it's deleted and run. I tested this basic conversion of
>> just switching it from an llist to an io_wq_work_list with a spinlock,
>> and I don't see any benefits from the lockless list. And for cases where
>> we get a bursty addition of task_work, this approach is faster as it
>> avoids the need to iterate the list upfront while reversing it.
> 
> I'm curious how you benchmarked it including accounting of irq/softirq
> where tw add usually happens?

Performance based and profiles. I tested send zc with small packets, as
that is task_work intensive and exhibits the bursty behavior I mentioned
in the patch / cover letter. And normal storage IO, IRQ driven.

For send zc, we're spending about 2% of the time doing list reversal,
and I've seen as high as 5 in other testing. And as that test is CPU
bound, performance is up about 2% as well.

With the patches, task work adding accounts for about 0.25% of the
cycles, before it's about 0.66%.

We're spending a bit more time in __io_run_local_work(), but I think
that's deceptive as we have to disable/enable interrupts now. If an
interrupt triggers on the unlock, that time tends to be attributed there
in terms of cycles.

> One known problem with the current list approach I mentioned several
> times before is that it peeks at the previous queued tw to count them.
> It's not nice, but that can be easily done with cmpxchg double. I
> wonder how much of an issue is that.

That's more of a wart than a real issue though, but we this approach
obviously doesn't do that. And then we can drop the rcu section around
adding local task_work. Not a huge deal, but still nice.

>> And this is less code and simpler, so I'd prefer to go that route.
> 
> I'm not sure it's less code, if you return optimisations that I
> believe were killed, see comments to patch 2, it might turn out to
> be even bulkier and not that simpler.

It's still considerably less:

 3 files changed, 59 insertions(+), 84 deletions(-)

thought that isn't conclusive by itself, as eg io_llist_xchg() goes away
which has some comments. But I do think the resulting code is simpler
and more straight forward.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list
  2024-03-27 15:45     ` Jens Axboe
@ 2024-03-27 16:37       ` Jens Axboe
  2024-03-27 17:28         ` Pavel Begunkov
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2024-03-27 16:37 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 3/27/24 9:45 AM, Jens Axboe wrote:
>> smp_mb(), see the comment below, and fwiw "_after_atomic" would not
>> work.
> 
> For this one, I think all we need to do is have the wq_list_empty()
> check be fully stable. If we read:
> 
> nr_wait = atomic_read(&ctx->cq_wait_nr);
> 
> right before a waiter does:
> 
> atomic_set(&ctx->cq_wait_nr, foo);
> set_current_state(TASK_INTERRUPTIBLE);
> 
> then we need to ensure that the "I have work" check in
> io_cqring_wait_schedule() sees the work. The spin_unlock() has release
> semantics, and the current READ_ONCE() for work check sbould be enough,
> no?

To answer my own question - no, it's not enough. Let me think about this
a bit.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHSET 0/4] Use io_wq_work_list for task_work
  2024-03-27 16:36   ` Jens Axboe
@ 2024-03-27 17:05     ` Jens Axboe
  2024-03-27 18:04     ` Pavel Begunkov
  1 sibling, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2024-03-27 17:05 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 3/27/24 10:36 AM, Jens Axboe wrote:
> On 3/27/24 7:33 AM, Pavel Begunkov wrote:
>> On 3/26/24 18:42, Jens Axboe wrote:
>>> Hi,
>>>
>>> This converts the deferred, normal, and fallback task_work to use a
>>> normal io_wq_work_list, rather than an llist.
>>>
>>> The main motivation behind this is to get rid of the need to reverse
>>> the list once it's deleted and run. I tested this basic conversion of
>>> just switching it from an llist to an io_wq_work_list with a spinlock,
>>> and I don't see any benefits from the lockless list. And for cases where
>>> we get a bursty addition of task_work, this approach is faster as it
>>> avoids the need to iterate the list upfront while reversing it.
>>
>> I'm curious how you benchmarked it including accounting of irq/softirq
>> where tw add usually happens?
> 
> Performance based and profiles. I tested send zc with small packets, as
> that is task_work intensive and exhibits the bursty behavior I mentioned
> in the patch / cover letter. And normal storage IO, IRQ driven.
> 
> For send zc, we're spending about 2% of the time doing list reversal,
> and I've seen as high as 5 in other testing. And as that test is CPU
> bound, performance is up about 2% as well.
> 
> With the patches, task work adding accounts for about 0.25% of the
> cycles, before it's about 0.66%.
> 
> We're spending a bit more time in __io_run_local_work(), but I think
> that's deceptive as we have to disable/enable interrupts now. If an
> interrupt triggers on the unlock, that time tends to be attributed there
> in terms of cycles.

Forgot to mention the storage side - profiles look eerily similar in
terms of time spent in task work adding / running, the only real
difference is that 1.9% of llist_reverse_list() is gone.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list
  2024-03-27 16:37       ` Jens Axboe
@ 2024-03-27 17:28         ` Pavel Begunkov
  2024-03-27 17:34           ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2024-03-27 17:28 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 3/27/24 16:37, Jens Axboe wrote:
> On 3/27/24 9:45 AM, Jens Axboe wrote:
>>> smp_mb(), see the comment below, and fwiw "_after_atomic" would not
>>> work.
>>
>> For this one, I think all we need to do is have the wq_list_empty()
>> check be fully stable. If we read:
>>
>> nr_wait = atomic_read(&ctx->cq_wait_nr);
>>
>> right before a waiter does:
>>
>> atomic_set(&ctx->cq_wait_nr, foo);
>> set_current_state(TASK_INTERRUPTIBLE);
>>
>> then we need to ensure that the "I have work" check in
>> io_cqring_wait_schedule() sees the work. The spin_unlock() has release
>> semantics, and the current READ_ONCE() for work check sbould be enough,
>> no?
> 
> To answer my own question - no, it's not enough. Let me think about this
> a bit.

Right, to my knowledge release does nothing for write; read;
ordering, and all ops after can leak before the barrier.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list
  2024-03-27 17:28         ` Pavel Begunkov
@ 2024-03-27 17:34           ` Jens Axboe
  0 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2024-03-27 17:34 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 3/27/24 11:28 AM, Pavel Begunkov wrote:
> On 3/27/24 16:37, Jens Axboe wrote:
>> On 3/27/24 9:45 AM, Jens Axboe wrote:
>>>> smp_mb(), see the comment below, and fwiw "_after_atomic" would not
>>>> work.
>>>
>>> For this one, I think all we need to do is have the wq_list_empty()
>>> check be fully stable. If we read:
>>>
>>> nr_wait = atomic_read(&ctx->cq_wait_nr);
>>>
>>> right before a waiter does:
>>>
>>> atomic_set(&ctx->cq_wait_nr, foo);
>>> set_current_state(TASK_INTERRUPTIBLE);
>>>
>>> then we need to ensure that the "I have work" check in
>>> io_cqring_wait_schedule() sees the work. The spin_unlock() has release
>>> semantics, and the current READ_ONCE() for work check sbould be enough,
>>> no?
>>
>> To answer my own question - no, it's not enough. Let me think about this
>> a bit.
> 
> Right, to my knowledge release does nothing for write; read;
> ordering, and all ops after can leak before the barrier.

Yeah, it needs an smp_mb() before that atomic_read() on the task work
add side.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHSET 0/4] Use io_wq_work_list for task_work
  2024-03-27 16:36   ` Jens Axboe
  2024-03-27 17:05     ` Jens Axboe
@ 2024-03-27 18:04     ` Pavel Begunkov
  1 sibling, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2024-03-27 18:04 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 3/27/24 16:36, Jens Axboe wrote:
> On 3/27/24 7:33 AM, Pavel Begunkov wrote:
>> On 3/26/24 18:42, Jens Axboe wrote:
>>> Hi,
>>>
>>> This converts the deferred, normal, and fallback task_work to use a
>>> normal io_wq_work_list, rather than an llist.
>>>
>>> The main motivation behind this is to get rid of the need to reverse
>>> the list once it's deleted and run. I tested this basic conversion of
>>> just switching it from an llist to an io_wq_work_list with a spinlock,
>>> and I don't see any benefits from the lockless list. And for cases where
>>> we get a bursty addition of task_work, this approach is faster as it
>>> avoids the need to iterate the list upfront while reversing it.
>>
>> I'm curious how you benchmarked it including accounting of irq/softirq
>> where tw add usually happens?
> 
> Performance based and profiles. I tested send zc with small packets, as
> that is task_work intensive and exhibits the bursty behavior I mentioned
> in the patch / cover letter. And normal storage IO, IRQ driven.

I assume IRQs are firing on random CPUs then unless you configured
it. In which case it should be bouncing of the cacheline + that
peeking at the prev also needs to fetch it from RAM / further caches.
Unless there is enough of time for TCP to batch them.

> For send zc, we're spending about 2% of the time doing list reversal,
> and I've seen as high as 5 in other testing. And as that test is CPU

I've seen similar before, but for me it was overhead shifted from
__io_run_local_work() fetching requests into reverse touching all
of them. There should be a change in __io_run_local_work()
total cycles (incl children) then I assume

> bound, performance is up about 2% as well.

Did you count by any chance how many items there was in the
list? Average or so

> With the patches, task work adding accounts for about 0.25% of the
> cycles, before it's about 0.66%.

i.e. spinlock is faster. How come? Same cmpxchg in spinlock
with often cache misses, but with irq on/off on top. The only
diff I can remember is that peek into prev req.

> We're spending a bit more time in __io_run_local_work(), but I think
> that's deceptive as we have to disable/enable interrupts now. If an
> interrupt triggers on the unlock, that time tends to be attributed there
> in terms of cycles.

Hmm, I think if run_local_work runtime doesn't change you'd
statistically get same number of interrupts "items" hitting
it, but the would be condensed more to irq-off. Or are you
accounting for some irq delivery / hw differences?

>> One known problem with the current list approach I mentioned several
>> times before is that it peeks at the previous queued tw to count them.
>> It's not nice, but that can be easily done with cmpxchg double. I
>> wonder how much of an issue is that.
> 
> That's more of a wart than a real issue though, but we this approach

Assuming tw add executing on random CPUs, that would be additional
fetch every tw add, I wouldn't disregard it right away.

> obviously doesn't do that. And then we can drop the rcu section around
> adding local task_work. Not a huge deal, but still nice.

I don't see how. After you queue the req it might be immediately
executed dropping the ctx ref, which was previously protecting ctx.
The rcu section protects ctx.

>>> And this is less code and simpler, so I'd prefer to go that route.
>>
>> I'm not sure it's less code, if you return optimisations that I
>> believe were killed, see comments to patch 2, it might turn out to
>> be even bulkier and not that simpler.
> 
> It's still considerably less:
> 
>   3 files changed, 59 insertions(+), 84 deletions(-)
> 
> thought that isn't conclusive by itself, as eg io_llist_xchg() goes away
> which has some comments. But I do think the resulting code is simpler
> and more straight forward.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-03-27 18:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-26 18:42 [PATCHSET 0/4] Use io_wq_work_list for task_work Jens Axboe
2024-03-26 18:42 ` [PATCH 1/4] io_uring: use the right type for work_llist empty check Jens Axboe
2024-03-26 18:42 ` [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list Jens Axboe
2024-03-27 13:24   ` Pavel Begunkov
2024-03-27 15:45     ` Jens Axboe
2024-03-27 16:37       ` Jens Axboe
2024-03-27 17:28         ` Pavel Begunkov
2024-03-27 17:34           ` Jens Axboe
2024-03-26 18:42 ` [PATCH 3/4] io_uring: switch fallback work to io_wq_work_list Jens Axboe
2024-03-26 18:42 ` [PATCH 4/4] io_uring: switch normal task_work " Jens Axboe
2024-03-27 13:33 ` [PATCHSET 0/4] Use io_wq_work_list for task_work Pavel Begunkov
2024-03-27 16:36   ` Jens Axboe
2024-03-27 17:05     ` Jens Axboe
2024-03-27 18:04     ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox