public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH next v1 0/2] limit local tw done
@ 2024-11-20 22:14 David Wei
  2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: David Wei @ 2024-11-20 22:14 UTC (permalink / raw)
  To: io-uring; +Cc: David Wei, Jens Axboe, Pavel Begunkov

Currently when running local tw it will eagerly try and complete all
available work. When there is a burst of work coming in, the list of
work in work_llist may be much larger than the requested batch count
wait_nr. Doing all of the work eagerly may cause latency issues for some
applications that do not want to spend so much time in the kernel.

Limit the amount of local tw done to the max of 20 (a heuristic) or
wait_nr. This then does not require any userspace changes.

Many applications will submit and wait with wait_nr = 1 to mean "wait
for _at least_ 1 completion, but do more work if available". This used
to mean "all work" but now that is capped to 20 requests. If users want
more work batching, then they can set wait_nr to > 20.

David Wei (2):
  io_uring: add io_local_work_pending()
  io_uring: limit local tw done

 include/linux/io_uring_types.h |  1 +
 io_uring/io_uring.c            | 57 +++++++++++++++++++++++-----------
 io_uring/io_uring.h            |  9 ++++--
 3 files changed, 47 insertions(+), 20 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH next v1 1/2] io_uring: add io_local_work_pending()
  2024-11-20 22:14 [PATCH next v1 0/2] limit local tw done David Wei
@ 2024-11-20 22:14 ` David Wei
  2024-11-20 23:45   ` Pavel Begunkov
  2024-11-20 22:14 ` [PATCH next v1 2/2] io_uring: limit local tw done David Wei
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 30+ messages in thread
From: David Wei @ 2024-11-20 22:14 UTC (permalink / raw)
  To: io-uring; +Cc: David Wei, Jens Axboe, Pavel Begunkov

In preparation for adding a new llist of tw to retry due to hitting the
tw limit, add a helper io_local_work_pending(). This function returns
true if there is any local tw pending. For now it only checks
ctx->work_llist.

Signed-off-by: David Wei <[email protected]>
---
 io_uring/io_uring.c | 14 +++++++-------
 io_uring/io_uring.h |  9 +++++++--
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 801293399883..83bf041d2648 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1260,7 +1260,7 @@ static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
 				       int min_events)
 {
-	if (llist_empty(&ctx->work_llist))
+	if (!io_local_work_pending(ctx))
 		return false;
 	if (events < min_events)
 		return true;
@@ -1313,7 +1313,7 @@ static inline int io_run_local_work_locked(struct io_ring_ctx *ctx,
 {
 	struct io_tw_state ts = {};
 
-	if (llist_empty(&ctx->work_llist))
+	if (!io_local_work_pending(ctx))
 		return 0;
 	return __io_run_local_work(ctx, &ts, min_events);
 }
@@ -2328,7 +2328,7 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
 
 int io_run_task_work_sig(struct io_ring_ctx *ctx)
 {
-	if (!llist_empty(&ctx->work_llist)) {
+	if (io_local_work_pending(ctx)) {
 		__set_current_state(TASK_RUNNING);
 		if (io_run_local_work(ctx, INT_MAX) > 0)
 			return 0;
@@ -2459,7 +2459,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 {
 	if (unlikely(READ_ONCE(ctx->check_cq)))
 		return 1;
-	if (unlikely(!llist_empty(&ctx->work_llist)))
+	if (unlikely(io_local_work_pending(ctx)))
 		return 1;
 	if (unlikely(task_work_pending(current)))
 		return 1;
@@ -2493,7 +2493,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
 
 	if (!io_allowed_run_tw(ctx))
 		return -EEXIST;
-	if (!llist_empty(&ctx->work_llist))
+	if (io_local_work_pending(ctx))
 		io_run_local_work(ctx, min_events);
 	io_run_task_work();
 
@@ -2564,7 +2564,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
 		 * If we got woken because of task_work being processed, run it
 		 * now rather than let the caller do another wait loop.
 		 */
-		if (!llist_empty(&ctx->work_llist))
+		if (io_local_work_pending(ctx))
 			io_run_local_work(ctx, nr_wait);
 		io_run_task_work();
 
@@ -3158,7 +3158,7 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
 		io_run_task_work();
 		io_uring_drop_tctx_refs(current);
 		xa_for_each(&tctx->xa, index, node) {
-			if (!llist_empty(&node->ctx->work_llist)) {
+			if (io_local_work_pending(node->ctx)) {
 				WARN_ON_ONCE(node->ctx->submitter_task &&
 					     node->ctx->submitter_task != current);
 				goto end_wait;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 4070d4c8ef97..69eb3b23a5a0 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -347,9 +347,14 @@ static inline int io_run_task_work(void)
 	return ret;
 }
 
+static inline bool io_local_work_pending(struct io_ring_ctx *ctx)
+{
+	return !llist_empty(&ctx->work_llist);
+}
+
 static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
 {
-	return task_work_pending(current) || !llist_empty(&ctx->work_llist);
+	return task_work_pending(current) || io_local_work_pending(ctx);
 }
 
 static inline void io_tw_lock(struct io_ring_ctx *ctx, struct io_tw_state *ts)
@@ -484,6 +489,6 @@ enum {
 static inline bool io_has_work(struct io_ring_ctx *ctx)
 {
 	return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq) ||
-	       !llist_empty(&ctx->work_llist);
+	       io_local_work_pending(ctx);
 }
 #endif
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-20 22:14 [PATCH next v1 0/2] limit local tw done David Wei
  2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei
@ 2024-11-20 22:14 ` David Wei
  2024-11-20 23:56   ` Pavel Begunkov
  2024-11-21  1:12 ` [PATCH next v1 0/2] " Jens Axboe
  2024-11-21 14:16 ` Jens Axboe
  3 siblings, 1 reply; 30+ messages in thread
From: David Wei @ 2024-11-20 22:14 UTC (permalink / raw)
  To: io-uring; +Cc: David Wei, Jens Axboe, Pavel Begunkov

Instead of eagerly running all available local tw, limit the amount of
local tw done to the max of IO_LOCAL_TW_DEFAULT_MAX (20) or wait_nr. The
value of 20 is chosen as a reasonable heuristic to allow enough work
batching but also keep latency down.

Add a retry_llist that maintains a list of local tw that couldn't be
done in time. No synchronisation is needed since it is only modified
within the task context.

Signed-off-by: David Wei <[email protected]>
---
 include/linux/io_uring_types.h |  1 +
 io_uring/io_uring.c            | 43 +++++++++++++++++++++++++---------
 io_uring/io_uring.h            |  2 +-
 3 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 593c10a02144..011860ade268 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -336,6 +336,7 @@ struct io_ring_ctx {
 	 */
 	struct {
 		struct llist_head	work_llist;
+		struct llist_head	retry_llist;
 		unsigned long		check_cq;
 		atomic_t		cq_wait_nr;
 		atomic_t		cq_timeouts;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 83bf041d2648..c3a7d0197636 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -121,6 +121,7 @@
 
 #define IO_COMPL_BATCH			32
 #define IO_REQ_ALLOC_BATCH		8
+#define IO_LOCAL_TW_DEFAULT_MAX		20
 
 struct io_defer_entry {
 	struct list_head	list;
@@ -1255,6 +1256,8 @@ static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 	struct llist_node *node = llist_del_all(&ctx->work_llist);
 
 	__io_fallback_tw(node, false);
+	node = llist_del_all(&ctx->retry_llist);
+	__io_fallback_tw(node, false);
 }
 
 static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
@@ -1269,37 +1272,55 @@ static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
 	return false;
 }
 
+static int __io_run_local_work_loop(struct llist_node **node,
+				    struct io_tw_state *ts,
+				    int events)
+{
+	while (*node) {
+		struct llist_node *next = (*node)->next;
+		struct io_kiocb *req = container_of(*node, struct io_kiocb,
+						    io_task_work.node);
+		INDIRECT_CALL_2(req->io_task_work.func,
+				io_poll_task_func, io_req_rw_complete,
+				req, ts);
+		*node = next;
+		if (--events <= 0)
+			break;
+	}
+
+	return events;
+}
+
 static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts,
 			       int min_events)
 {
 	struct llist_node *node;
 	unsigned int loops = 0;
-	int ret = 0;
+	int ret, limit;
 
 	if (WARN_ON_ONCE(ctx->submitter_task != current))
 		return -EEXIST;
 	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
 		atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
+	limit = max(IO_LOCAL_TW_DEFAULT_MAX, min_events);
 again:
+	ret = __io_run_local_work_loop(&ctx->retry_llist.first, ts, limit);
+	if (ctx->retry_llist.first)
+		goto retry_done;
+
 	/*
 	 * llists are in reverse order, flip it back the right way before
 	 * running the pending items.
 	 */
 	node = llist_reverse_order(llist_del_all(&ctx->work_llist));
-	while (node) {
-		struct llist_node *next = node->next;
-		struct io_kiocb *req = container_of(node, struct io_kiocb,
-						    io_task_work.node);
-		INDIRECT_CALL_2(req->io_task_work.func,
-				io_poll_task_func, io_req_rw_complete,
-				req, ts);
-		ret++;
-		node = next;
-	}
+	ret = __io_run_local_work_loop(&node, ts, ret);
+	ctx->retry_llist.first = node;
 	loops++;
 
+	ret = limit - ret;
 	if (io_run_local_work_continue(ctx, ret, min_events))
 		goto again;
+retry_done:
 	io_submit_flush_completions(ctx);
 	if (io_run_local_work_continue(ctx, ret, min_events))
 		goto again;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 69eb3b23a5a0..12abee607e4a 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -349,7 +349,7 @@ static inline int io_run_task_work(void)
 
 static inline bool io_local_work_pending(struct io_ring_ctx *ctx)
 {
-	return !llist_empty(&ctx->work_llist);
+	return !llist_empty(&ctx->work_llist) || !llist_empty(&ctx->retry_llist);
 }
 
 static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 1/2] io_uring: add io_local_work_pending()
  2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei
@ 2024-11-20 23:45   ` Pavel Begunkov
  0 siblings, 0 replies; 30+ messages in thread
From: Pavel Begunkov @ 2024-11-20 23:45 UTC (permalink / raw)
  To: David Wei, io-uring; +Cc: Jens Axboe

On 11/20/24 22:14, David Wei wrote:
> In preparation for adding a new llist of tw to retry due to hitting the
> tw limit, add a helper io_local_work_pending(). This function returns
> true if there is any local tw pending. For now it only checks
> ctx->work_llist.

Looks clean, we can even take it separately from 2/2

Reviewed-by: Pavel Begunkov <[email protected]>

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-20 22:14 ` [PATCH next v1 2/2] io_uring: limit local tw done David Wei
@ 2024-11-20 23:56   ` Pavel Begunkov
  2024-11-21  0:52     ` David Wei
  2024-11-21  1:12     ` Jens Axboe
  0 siblings, 2 replies; 30+ messages in thread
From: Pavel Begunkov @ 2024-11-20 23:56 UTC (permalink / raw)
  To: David Wei, io-uring; +Cc: Jens Axboe

On 11/20/24 22:14, David Wei wrote:
> Instead of eagerly running all available local tw, limit the amount of
> local tw done to the max of IO_LOCAL_TW_DEFAULT_MAX (20) or wait_nr. The
> value of 20 is chosen as a reasonable heuristic to allow enough work
> batching but also keep latency down.
> 
> Add a retry_llist that maintains a list of local tw that couldn't be
> done in time. No synchronisation is needed since it is only modified
> within the task context.
> 
> Signed-off-by: David Wei <[email protected]>
> ---
>   include/linux/io_uring_types.h |  1 +
>   io_uring/io_uring.c            | 43 +++++++++++++++++++++++++---------
>   io_uring/io_uring.h            |  2 +-
>   3 files changed, 34 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 593c10a02144..011860ade268 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -336,6 +336,7 @@ struct io_ring_ctx {
>   	 */
>   	struct {
>   		struct llist_head	work_llist;
> +		struct llist_head	retry_llist;

Fwiw, probably doesn't matter, but it doesn't even need
to be atomic, it's queued and spliced while holding
->uring_lock, the pending check is also synchronised as
there is only one possible task doing that.

>   		unsigned long		check_cq;
>   		atomic_t		cq_wait_nr;
>   		atomic_t		cq_timeouts;
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 83bf041d2648..c3a7d0197636 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -121,6 +121,7 @@
...
>   static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts,
>   			       int min_events)
>   {
>   	struct llist_node *node;
>   	unsigned int loops = 0;
> -	int ret = 0;
> +	int ret, limit;
>   
>   	if (WARN_ON_ONCE(ctx->submitter_task != current))
>   		return -EEXIST;
>   	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
>   		atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
> +	limit = max(IO_LOCAL_TW_DEFAULT_MAX, min_events);
>   again:
> +	ret = __io_run_local_work_loop(&ctx->retry_llist.first, ts, limit);
> +	if (ctx->retry_llist.first)
> +		goto retry_done;
> +
>   	/*
>   	 * llists are in reverse order, flip it back the right way before
>   	 * running the pending items.
>   	 */
>   	node = llist_reverse_order(llist_del_all(&ctx->work_llist));
> -	while (node) {
> -		struct llist_node *next = node->next;
> -		struct io_kiocb *req = container_of(node, struct io_kiocb,
> -						    io_task_work.node);
> -		INDIRECT_CALL_2(req->io_task_work.func,
> -				io_poll_task_func, io_req_rw_complete,
> -				req, ts);
> -		ret++;
> -		node = next;
> -	}
> +	ret = __io_run_local_work_loop(&node, ts, ret);

One thing that is not so nice is that now we have this handling and
checks in the hot path, and __io_run_local_work_loop() most likely
gets uninlined.

I wonder, can we just requeue it via task_work again? We can even
add a variant efficiently adding a list instead of a single entry,
i.e. local_task_work_add(head, tail, ...);

I'm also curious what's the use case you've got that is hitting
the problem?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-20 23:56   ` Pavel Begunkov
@ 2024-11-21  0:52     ` David Wei
  2024-11-21 14:29       ` Pavel Begunkov
  2024-11-21  1:12     ` Jens Axboe
  1 sibling, 1 reply; 30+ messages in thread
From: David Wei @ 2024-11-21  0:52 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Jens Axboe

On 2024-11-20 15:56, Pavel Begunkov wrote:
> On 11/20/24 22:14, David Wei wrote:
>> Instead of eagerly running all available local tw, limit the amount of
>> local tw done to the max of IO_LOCAL_TW_DEFAULT_MAX (20) or wait_nr. The
>> value of 20 is chosen as a reasonable heuristic to allow enough work
>> batching but also keep latency down.
>>
>> Add a retry_llist that maintains a list of local tw that couldn't be
>> done in time. No synchronisation is needed since it is only modified
>> within the task context.
>>
>> Signed-off-by: David Wei <[email protected]>
>> ---
>>   include/linux/io_uring_types.h |  1 +
>>   io_uring/io_uring.c            | 43 +++++++++++++++++++++++++---------
>>   io_uring/io_uring.h            |  2 +-
>>   3 files changed, 34 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index 593c10a02144..011860ade268 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -336,6 +336,7 @@ struct io_ring_ctx {
>>        */
>>       struct {
>>           struct llist_head    work_llist;
>> +        struct llist_head    retry_llist;
> 
> Fwiw, probably doesn't matter, but it doesn't even need
> to be atomic, it's queued and spliced while holding
> ->uring_lock, the pending check is also synchronised as
> there is only one possible task doing that.

Yeah, it doesn't. Keeping it as a llist_head means being able to re-use
helpers that take llist_head or llist_node.

> 
>>           unsigned long        check_cq;
>>           atomic_t        cq_wait_nr;
>>           atomic_t        cq_timeouts;
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 83bf041d2648..c3a7d0197636 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -121,6 +121,7 @@
> ...
>>   static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts,
>>                      int min_events)
>>   {
>>       struct llist_node *node;
>>       unsigned int loops = 0;
>> -    int ret = 0;
>> +    int ret, limit;
>>         if (WARN_ON_ONCE(ctx->submitter_task != current))
>>           return -EEXIST;
>>       if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
>>           atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
>> +    limit = max(IO_LOCAL_TW_DEFAULT_MAX, min_events);
>>   again:
>> +    ret = __io_run_local_work_loop(&ctx->retry_llist.first, ts, limit);
>> +    if (ctx->retry_llist.first)
>> +        goto retry_done;
>> +
>>       /*
>>        * llists are in reverse order, flip it back the right way before
>>        * running the pending items.
>>        */
>>       node = llist_reverse_order(llist_del_all(&ctx->work_llist));
>> -    while (node) {
>> -        struct llist_node *next = node->next;
>> -        struct io_kiocb *req = container_of(node, struct io_kiocb,
>> -                            io_task_work.node);
>> -        INDIRECT_CALL_2(req->io_task_work.func,
>> -                io_poll_task_func, io_req_rw_complete,
>> -                req, ts);
>> -        ret++;
>> -        node = next;
>> -    }
>> +    ret = __io_run_local_work_loop(&node, ts, ret);
> 
> One thing that is not so nice is that now we have this handling and
> checks in the hot path, and __io_run_local_work_loop() most likely
> gets uninlined.
> 
> I wonder, can we just requeue it via task_work again? We can even
> add a variant efficiently adding a list instead of a single entry,
> i.e. local_task_work_add(head, tail, ...);

That was an early idea, but it means re-reversing the list and then
atomically adding each node back to work_llist concurrently with e.g.
io_req_local_work_add().

Using a separate retry_llist means we don't need to concurrently add to
either retry_llist or work_llist.

> 
> I'm also curious what's the use case you've got that is hitting
> the problem?
> 

There is a Memcache-like workload that has load shedding based on the
time spent doing work. With epoll, the work of reading sockets and
processing a request is done by user, which can decide after some amount
of time to drop the remaining work if it takes too long. With io_uring,
the work of reading sockets is done eagerly inside of task work. If
there is a burst of work, then so much time is spent in task work
reading from sockets that, by the time control returns to user the
timeout has already elapsed.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-20 23:56   ` Pavel Begunkov
  2024-11-21  0:52     ` David Wei
@ 2024-11-21  1:12     ` Jens Axboe
  2024-11-21 14:25       ` Pavel Begunkov
  1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2024-11-21  1:12 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring

On 11/20/24 4:56 PM, Pavel Begunkov wrote:
> On 11/20/24 22:14, David Wei wrote:
>> Instead of eagerly running all available local tw, limit the amount of
>> local tw done to the max of IO_LOCAL_TW_DEFAULT_MAX (20) or wait_nr. The
>> value of 20 is chosen as a reasonable heuristic to allow enough work
>> batching but also keep latency down.
>>
>> Add a retry_llist that maintains a list of local tw that couldn't be
>> done in time. No synchronisation is needed since it is only modified
>> within the task context.
>>
>> Signed-off-by: David Wei <[email protected]>
>> ---
>>   include/linux/io_uring_types.h |  1 +
>>   io_uring/io_uring.c            | 43 +++++++++++++++++++++++++---------
>>   io_uring/io_uring.h            |  2 +-
>>   3 files changed, 34 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index 593c10a02144..011860ade268 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -336,6 +336,7 @@ struct io_ring_ctx {
>>        */
>>       struct {
>>           struct llist_head    work_llist;
>> +        struct llist_head    retry_llist;
> 
> Fwiw, probably doesn't matter, but it doesn't even need
> to be atomic, it's queued and spliced while holding
> ->uring_lock, the pending check is also synchronised as
> there is only one possible task doing that.
> 
>>           unsigned long        check_cq;
>>           atomic_t        cq_wait_nr;
>>           atomic_t        cq_timeouts;
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 83bf041d2648..c3a7d0197636 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -121,6 +121,7 @@
> ...
>>   static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts,
>>                      int min_events)
>>   {
>>       struct llist_node *node;
>>       unsigned int loops = 0;
>> -    int ret = 0;
>> +    int ret, limit;
>>         if (WARN_ON_ONCE(ctx->submitter_task != current))
>>           return -EEXIST;
>>       if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
>>           atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
>> +    limit = max(IO_LOCAL_TW_DEFAULT_MAX, min_events);
>>   again:
>> +    ret = __io_run_local_work_loop(&ctx->retry_llist.first, ts, limit);
>> +    if (ctx->retry_llist.first)
>> +        goto retry_done;
>> +
>>       /*
>>        * llists are in reverse order, flip it back the right way before
>>        * running the pending items.
>>        */
>>       node = llist_reverse_order(llist_del_all(&ctx->work_llist));
>> -    while (node) {
>> -        struct llist_node *next = node->next;
>> -        struct io_kiocb *req = container_of(node, struct io_kiocb,
>> -                            io_task_work.node);
>> -        INDIRECT_CALL_2(req->io_task_work.func,
>> -                io_poll_task_func, io_req_rw_complete,
>> -                req, ts);
>> -        ret++;
>> -        node = next;
>> -    }
>> +    ret = __io_run_local_work_loop(&node, ts, ret);
> 
> One thing that is not so nice is that now we have this handling and
> checks in the hot path, and __io_run_local_work_loop() most likely
> gets uninlined.

I don't think that really matters, it's pretty light. The main overhead
in this function is not the call, it's reordering requests and touching
cachelines of the requests.

I think it's pretty light as-is and actually looks pretty good. It's
also similar to how sqpoll bites over longer task_work lines, and
arguably a mistake that we allow huge depths of this when we can avoid
it with deferred task_work.

> I wonder, can we just requeue it via task_work again? We can even
> add a variant efficiently adding a list instead of a single entry,
> i.e. local_task_work_add(head, tail, ...);

I think that can only work if we change work_llist to be a regular list
with regular locking. Otherwise it's a bit of a mess with the list being
reordered, and then you're spending extra cycles on potentially
reordering all the entries again.

> I'm also curious what's the use case you've got that is hitting
> the problem?

I'll let David answer that one, but some task_work can take a while to
run, eg if it's not just posting a completion.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 0/2] limit local tw done
  2024-11-20 22:14 [PATCH next v1 0/2] limit local tw done David Wei
  2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei
  2024-11-20 22:14 ` [PATCH next v1 2/2] io_uring: limit local tw done David Wei
@ 2024-11-21  1:12 ` Jens Axboe
  2024-11-21 14:16 ` Jens Axboe
  3 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2024-11-21  1:12 UTC (permalink / raw)
  To: David Wei, io-uring; +Cc: Pavel Begunkov

On 11/20/24 3:14 PM, David Wei wrote:
> Currently when running local tw it will eagerly try and complete all
> available work. When there is a burst of work coming in, the list of
> work in work_llist may be much larger than the requested batch count
> wait_nr. Doing all of the work eagerly may cause latency issues for some
> applications that do not want to spend so much time in the kernel.
> 
> Limit the amount of local tw done to the max of 20 (a heuristic) or
> wait_nr. This then does not require any userspace changes.
> 
> Many applications will submit and wait with wait_nr = 1 to mean "wait
> for _at least_ 1 completion, but do more work if available". This used
> to mean "all work" but now that is capped to 20 requests. If users want
> more work batching, then they can set wait_nr to > 20.

Looks good to me!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 0/2] limit local tw done
  2024-11-20 22:14 [PATCH next v1 0/2] limit local tw done David Wei
                   ` (2 preceding siblings ...)
  2024-11-21  1:12 ` [PATCH next v1 0/2] " Jens Axboe
@ 2024-11-21 14:16 ` Jens Axboe
  3 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2024-11-21 14:16 UTC (permalink / raw)
  To: io-uring, David Wei; +Cc: Pavel Begunkov


On Wed, 20 Nov 2024 14:14:50 -0800, David Wei wrote:
> Currently when running local tw it will eagerly try and complete all
> available work. When there is a burst of work coming in, the list of
> work in work_llist may be much larger than the requested batch count
> wait_nr. Doing all of the work eagerly may cause latency issues for some
> applications that do not want to spend so much time in the kernel.
> 
> Limit the amount of local tw done to the max of 20 (a heuristic) or
> wait_nr. This then does not require any userspace changes.
> 
> [...]

Applied, thanks!

[1/2] io_uring: add io_local_work_pending()
      commit: 40cfe553240b32333b42652370ef5232e6ac59e1
[2/2] io_uring: limit local tw done
      commit: f46b9cdb22f7a167c36b6bcddaef7e8aee2598fa

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21  1:12     ` Jens Axboe
@ 2024-11-21 14:25       ` Pavel Begunkov
  2024-11-21 14:31         ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Pavel Begunkov @ 2024-11-21 14:25 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring

On 11/21/24 01:12, Jens Axboe wrote:
> On 11/20/24 4:56 PM, Pavel Begunkov wrote:
>> On 11/20/24 22:14, David Wei wrote:
...
>> One thing that is not so nice is that now we have this handling and
>> checks in the hot path, and __io_run_local_work_loop() most likely
>> gets uninlined.
> 
> I don't think that really matters, it's pretty light. The main overhead
> in this function is not the call, it's reordering requests and touching
> cachelines of the requests.
> 
> I think it's pretty light as-is and actually looks pretty good. It's

It could be light, but the question is importance / frequency of
the new path. If it only happens rarely but affects a high 9,
then it'd more sense to optimise it from the common path.

> also similar to how sqpoll bites over longer task_work lines, and
> arguably a mistake that we allow huge depths of this when we can avoid
> it with deferred task_work.
> 
>> I wonder, can we just requeue it via task_work again? We can even
>> add a variant efficiently adding a list instead of a single entry,
>> i.e. local_task_work_add(head, tail, ...);
> 
> I think that can only work if we change work_llist to be a regular list
> with regular locking. Otherwise it's a bit of a mess with the list being

Dylan once measured the overhead of locks vs atomics in this
path for some artificial case, we can pull the numbers up.

> reordered, and then you're spending extra cycles on potentially
> reordering all the entries again.

That sucks, I agree, but then it's same question of how often
it happens.


-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21  0:52     ` David Wei
@ 2024-11-21 14:29       ` Pavel Begunkov
  2024-11-21 14:34         ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Pavel Begunkov @ 2024-11-21 14:29 UTC (permalink / raw)
  To: David Wei, io-uring; +Cc: Jens Axboe

On 11/21/24 00:52, David Wei wrote:
> On 2024-11-20 15:56, Pavel Begunkov wrote:
>> On 11/20/24 22:14, David Wei wrote:
...
>> One thing that is not so nice is that now we have this handling and
>> checks in the hot path, and __io_run_local_work_loop() most likely
>> gets uninlined.
>>
>> I wonder, can we just requeue it via task_work again? We can even
>> add a variant efficiently adding a list instead of a single entry,
>> i.e. local_task_work_add(head, tail, ...);
> 
> That was an early idea, but it means re-reversing the list and then
> atomically adding each node back to work_llist concurrently with e.g.
> io_req_local_work_add().
> 
> Using a separate retry_llist means we don't need to concurrently add to
> either retry_llist or work_llist.
> 
>>
>> I'm also curious what's the use case you've got that is hitting
>> the problem?
>>
> 
> There is a Memcache-like workload that has load shedding based on the
> time spent doing work. With epoll, the work of reading sockets and
> processing a request is done by user, which can decide after some amount
> of time to drop the remaining work if it takes too long. With io_uring,
> the work of reading sockets is done eagerly inside of task work. If
> there is a burst of work, then so much time is spent in task work
> reading from sockets that, by the time control returns to user the
> timeout has already elapsed.

Interesting, it also sounds like instead of an arbitrary 20 we
might want the user to feed it to us. Might be easier to do it
with the bpf toy not to carve another argument.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 14:25       ` Pavel Begunkov
@ 2024-11-21 14:31         ` Jens Axboe
  2024-11-21 15:07           ` Pavel Begunkov
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2024-11-21 14:31 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring

On 11/21/24 7:25 AM, Pavel Begunkov wrote:
> On 11/21/24 01:12, Jens Axboe wrote:
>> On 11/20/24 4:56 PM, Pavel Begunkov wrote:
>>> On 11/20/24 22:14, David Wei wrote:
> ...
>>> One thing that is not so nice is that now we have this handling and
>>> checks in the hot path, and __io_run_local_work_loop() most likely
>>> gets uninlined.
>>
>> I don't think that really matters, it's pretty light. The main overhead
>> in this function is not the call, it's reordering requests and touching
>> cachelines of the requests.
>>
>> I think it's pretty light as-is and actually looks pretty good. It's
> 
> It could be light, but the question is importance / frequency of
> the new path. If it only happens rarely but affects a high 9,
> then it'd more sense to optimise it from the common path.

I'm more worried about the outlier cases. We don't generally expect this
to trigger very much obviously, if long chains of task_work was the norm
then we'd have other reports/issues related to that. But the common
overhead here is really just checking if another (same cacheline)
pointer is non-NULL, and ditto on the run side. Really don't think
that's anything to worry about.

>> also similar to how sqpoll bites over longer task_work lines, and
>> arguably a mistake that we allow huge depths of this when we can avoid
>> it with deferred task_work.
>>
>>> I wonder, can we just requeue it via task_work again? We can even
>>> add a variant efficiently adding a list instead of a single entry,
>>> i.e. local_task_work_add(head, tail, ...);
>>
>> I think that can only work if we change work_llist to be a regular list
>> with regular locking. Otherwise it's a bit of a mess with the list being
> 
> Dylan once measured the overhead of locks vs atomics in this
> path for some artificial case, we can pull the numbers up.

I did it more recently if you'll remember, actually posted a patch I
think a few months ago changing it to that. But even that approach adds
extra overhead, if you want to add it to the same list as now you need
to re-grab (and re-disable interrupts) the lock to add it back. My gut
says that would be _worse_ than the current approach. And if you keep a
separate list instead, well then you're back to identical overhead in
terms of now needing to check both when needing to know if anything is
pending, and checking both when running it.

>> reordered, and then you're spending extra cycles on potentially
>> reordering all the entries again.
> 
> That sucks, I agree, but then it's same question of how often
> it happens.

At least for now, there's a real issue reported and we should fix it. I
think the current patches are fine in that regard. That doesn't mean we
can't potentially make it better, we should certainly investigate that.
But I don't see the current patches as being suboptimal really, they are
definitely good enough as-is for solving the issue.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 14:29       ` Pavel Begunkov
@ 2024-11-21 14:34         ` Jens Axboe
  2024-11-21 14:58           ` Pavel Begunkov
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2024-11-21 14:34 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring

On 11/21/24 7:29 AM, Pavel Begunkov wrote:
> On 11/21/24 00:52, David Wei wrote:
>> On 2024-11-20 15:56, Pavel Begunkov wrote:
>>> On 11/20/24 22:14, David Wei wrote:
> ...
>>> One thing that is not so nice is that now we have this handling and
>>> checks in the hot path, and __io_run_local_work_loop() most likely
>>> gets uninlined.
>>>
>>> I wonder, can we just requeue it via task_work again? We can even
>>> add a variant efficiently adding a list instead of a single entry,
>>> i.e. local_task_work_add(head, tail, ...);
>>
>> That was an early idea, but it means re-reversing the list and then
>> atomically adding each node back to work_llist concurrently with e.g.
>> io_req_local_work_add().
>>
>> Using a separate retry_llist means we don't need to concurrently add to
>> either retry_llist or work_llist.
>>
>>>
>>> I'm also curious what's the use case you've got that is hitting
>>> the problem?
>>>
>>
>> There is a Memcache-like workload that has load shedding based on the
>> time spent doing work. With epoll, the work of reading sockets and
>> processing a request is done by user, which can decide after some amount
>> of time to drop the remaining work if it takes too long. With io_uring,
>> the work of reading sockets is done eagerly inside of task work. If
>> there is a burst of work, then so much time is spent in task work
>> reading from sockets that, by the time control returns to user the
>> timeout has already elapsed.
> 
> Interesting, it also sounds like instead of an arbitrary 20 we
> might want the user to feed it to us. Might be easier to do it
> with the bpf toy not to carve another argument.

David and I did discuss that, and I was not in favor of having an extra
argument. We really just need some kind of limit to prevent it
over-running. Arguably that should always be min_events, which we
already have, but that kind of runs afoul of applications just doing
io_uring_wait_cqe() and hence asking for 1. That's why the hand wavy
number exists, which is really no different than other hand wavy numbers
we have to limit running of "something" - eg other kinds of retries.

Adding another argument to this just again doubles wait logic complexity
in terms of the API. If it's needed down the line for whatever reason,
then yeah we can certainly do it, probably via the wait regions. But
adding it to the generic wait path would be a mistake imho.

I also strongly suggest this is the last we'll ever hear of this, and
for that reason alone I don't think it's worth any kind of extra
arguments or added complexity.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 14:34         ` Jens Axboe
@ 2024-11-21 14:58           ` Pavel Begunkov
  2024-11-21 15:02             ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Pavel Begunkov @ 2024-11-21 14:58 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring

On 11/21/24 14:34, Jens Axboe wrote:
> On 11/21/24 7:29 AM, Pavel Begunkov wrote:
>> On 11/21/24 00:52, David Wei wrote:
>>> On 2024-11-20 15:56, Pavel Begunkov wrote:
>>>> On 11/20/24 22:14, David Wei wrote:
...
>>> There is a Memcache-like workload that has load shedding based on the
>>> time spent doing work. With epoll, the work of reading sockets and
>>> processing a request is done by user, which can decide after some amount
>>> of time to drop the remaining work if it takes too long. With io_uring,
>>> the work of reading sockets is done eagerly inside of task work. If
>>> there is a burst of work, then so much time is spent in task work
>>> reading from sockets that, by the time control returns to user the
>>> timeout has already elapsed.
>>
>> Interesting, it also sounds like instead of an arbitrary 20 we
>> might want the user to feed it to us. Might be easier to do it
>> with the bpf toy not to carve another argument.
> 
> David and I did discuss that, and I was not in favor of having an extra
> argument. We really just need some kind of limit to prevent it
> over-running. Arguably that should always be min_events, which we
> already have, but that kind of runs afoul of applications just doing
> io_uring_wait_cqe() and hence asking for 1. That's why the hand wavy
> number exists, which is really no different than other hand wavy numbers
> we have to limit running of "something" - eg other kinds of retries.
> 
> Adding another argument to this just again doubles wait logic complexity
> in terms of the API. If it's needed down the line for whatever reason,
> then yeah we can certainly do it, probably via the wait regions. But
> adding it to the generic wait path would be a mistake imho.

Right, I don't like the idea of a wait argument either, messy and
too advanced of a tuning for it. BPF would be fine as it could be
made as a hint and easily removed if needed. It could also be a per
ring REGISTER based hint, a bit worse but with the same deprecation
argument. Obviously would need a good user / use case first.

> I also strongly suggest this is the last we'll ever hear of this, and
> for that reason alone I don't think it's worth any kind of extra
> arguments or added complexity.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 14:58           ` Pavel Begunkov
@ 2024-11-21 15:02             ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2024-11-21 15:02 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring

On 11/21/24 7:58 AM, Pavel Begunkov wrote:
> On 11/21/24 14:34, Jens Axboe wrote:
>> On 11/21/24 7:29 AM, Pavel Begunkov wrote:
>>> On 11/21/24 00:52, David Wei wrote:
>>>> On 2024-11-20 15:56, Pavel Begunkov wrote:
>>>>> On 11/20/24 22:14, David Wei wrote:
> ...
>>>> There is a Memcache-like workload that has load shedding based on the
>>>> time spent doing work. With epoll, the work of reading sockets and
>>>> processing a request is done by user, which can decide after some amount
>>>> of time to drop the remaining work if it takes too long. With io_uring,
>>>> the work of reading sockets is done eagerly inside of task work. If
>>>> there is a burst of work, then so much time is spent in task work
>>>> reading from sockets that, by the time control returns to user the
>>>> timeout has already elapsed.
>>>
>>> Interesting, it also sounds like instead of an arbitrary 20 we
>>> might want the user to feed it to us. Might be easier to do it
>>> with the bpf toy not to carve another argument.
>>
>> David and I did discuss that, and I was not in favor of having an extra
>> argument. We really just need some kind of limit to prevent it
>> over-running. Arguably that should always be min_events, which we
>> already have, but that kind of runs afoul of applications just doing
>> io_uring_wait_cqe() and hence asking for 1. That's why the hand wavy
>> number exists, which is really no different than other hand wavy numbers
>> we have to limit running of "something" - eg other kinds of retries.
>>
>> Adding another argument to this just again doubles wait logic complexity
>> in terms of the API. If it's needed down the line for whatever reason,
>> then yeah we can certainly do it, probably via the wait regions. But
>> adding it to the generic wait path would be a mistake imho.
> 
> Right, I don't like the idea of a wait argument either, messy and
> too advanced of a tuning for it. BPF would be fine as it could be
> made as a hint and easily removed if needed. It could also be a per
> ring REGISTER based hint, a bit worse but with the same deprecation
> argument. Obviously would need a good user / use case first.

Sure I agree on that, if you already have BPF integration for other
things, then yeah you could add tighter control for it. Even with that,
I doubt it'd be useful or meaningful really. Current case seems like a
worst case kind of thing, recv task_work is arguably one of the more
expensive things that can be done out of task_work.

We can revisit down the line if it ever becomes needed.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 14:31         ` Jens Axboe
@ 2024-11-21 15:07           ` Pavel Begunkov
  2024-11-21 15:15             ` Jens Axboe
  2024-11-21 17:53             ` David Wei
  0 siblings, 2 replies; 30+ messages in thread
From: Pavel Begunkov @ 2024-11-21 15:07 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring

On 11/21/24 14:31, Jens Axboe wrote:
> On 11/21/24 7:25 AM, Pavel Begunkov wrote:
>> On 11/21/24 01:12, Jens Axboe wrote:
>>> On 11/20/24 4:56 PM, Pavel Begunkov wrote:
>>>> On 11/20/24 22:14, David Wei wrote:
...
>>> I think that can only work if we change work_llist to be a regular list
>>> with regular locking. Otherwise it's a bit of a mess with the list being
>>
>> Dylan once measured the overhead of locks vs atomics in this
>> path for some artificial case, we can pull the numbers up.
> 
> I did it more recently if you'll remember, actually posted a patch I
> think a few months ago changing it to that. But even that approach adds

Right, and it's be a separate topic from this set.

> extra overhead, if you want to add it to the same list as now you need

Extra overhead to the retry path, which is not the hot path,
and coldness of it is uncertain.

> to re-grab (and re-disable interrupts) the lock to add it back. My gut
> says that would be _worse_ than the current approach. And if you keep a
> separate list instead, well then you're back to identical overhead in
> terms of now needing to check both when needing to know if anything is
> pending, and checking both when running it.
> 
>>> reordered, and then you're spending extra cycles on potentially
>>> reordering all the entries again.
>>
>> That sucks, I agree, but then it's same question of how often
>> it happens.
> 
> At least for now, there's a real issue reported and we should fix it. I
> think the current patches are fine in that regard. That doesn't mean we
> can't potentially make it better, we should certainly investigate that.
> But I don't see the current patches as being suboptimal really, they are
> definitely good enough as-is for solving the issue.

That's fair enough, but I still would love to know how frequent
it is. There is no purpose in optimising it as hot/slow path if
it triggers every fifth run or such. David, how easy it is to
get some stats? We can hack up some bpftrace script

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 15:07           ` Pavel Begunkov
@ 2024-11-21 15:15             ` Jens Axboe
  2024-11-21 15:22               ` Jens Axboe
  2024-11-21 17:53             ` David Wei
  1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2024-11-21 15:15 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring

On 11/21/24 8:07 AM, Pavel Begunkov wrote:
>> At least for now, there's a real issue reported and we should fix it. I
>> think the current patches are fine in that regard. That doesn't mean we
>> can't potentially make it better, we should certainly investigate that.
>> But I don't see the current patches as being suboptimal really, they are
>> definitely good enough as-is for solving the issue.
> 
> That's fair enough, but I still would love to know how frequent
> it is. There is no purpose in optimising it as hot/slow path if
> it triggers every fifth run or such. David, how easy it is to
> get some stats? We can hack up some bpftrace script

As mentioned, I don't think it's a frequent occurence in the sense that
it happens all the time, but it's one of those "if it hits, we're
screwed" cases as now we're way over budget. And we have zero limiting
in place to prevent it from happening. As such it doesn't really matter
how frequent it is, all that matters is that it cannot happen.

In terms of a different approach, eg "we can tolerate a bit more
overhead for the overrun case happening", then yeah that'd be
interesting as it may guide improvements. But the frequency of it
occuring for one case doesn't mean that it won't be a "hits 50% of the
time" for some other case. Living with a separate retry list (which is
lockless, after all) seems to me to be the least of all evils, as the
overhead is both limited and constant in all cases.

I'd rather entertain NOT using llists for this in the first place, as it
gets rid of the reversing which is the main cost here. That won't change
the need for a retry list necessarily, as I think we'd be better off
with a lockless retry list still. But at least it'd get rid of the
reversing. Let me see if I can dig out that patch... Totally orthogonal
to this topic, obviously.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 15:15             ` Jens Axboe
@ 2024-11-21 15:22               ` Jens Axboe
  2024-11-21 16:00                 ` Pavel Begunkov
  2024-11-21 16:18                 ` Pavel Begunkov
  0 siblings, 2 replies; 30+ messages in thread
From: Jens Axboe @ 2024-11-21 15:22 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring

On 11/21/24 8:15 AM, Jens Axboe wrote:
> I'd rather entertain NOT using llists for this in the first place, as it
> gets rid of the reversing which is the main cost here. That won't change
> the need for a retry list necessarily, as I think we'd be better off
> with a lockless retry list still. But at least it'd get rid of the
> reversing. Let me see if I can dig out that patch... Totally orthogonal
> to this topic, obviously.

It's here:

https://lore.kernel.org/io-uring/[email protected]/

I did improve it further but never posted it again, fwiw.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 15:22               ` Jens Axboe
@ 2024-11-21 16:00                 ` Pavel Begunkov
  2024-11-21 16:05                   ` Jens Axboe
  2024-11-21 16:18                 ` Pavel Begunkov
  1 sibling, 1 reply; 30+ messages in thread
From: Pavel Begunkov @ 2024-11-21 16:00 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring

On 11/21/24 15:22, Jens Axboe wrote:
> On 11/21/24 8:15 AM, Jens Axboe wrote:
>> I'd rather entertain NOT using llists for this in the first place, as it
>> gets rid of the reversing which is the main cost here. That won't change
>> the need for a retry list necessarily, as I think we'd be better off
>> with a lockless retry list still. But at least it'd get rid of the
>> reversing. Let me see if I can dig out that patch... Totally orthogonal
>> to this topic, obviously.
> 
> It's here:
> 
> https://lore.kernel.org/io-uring/[email protected]/
> 
> I did improve it further but never posted it again, fwiw.
It's nice that with sth like that we're not restricted by space and be
smarter about batching, e.g. splitting nr_tw into buckets. However, the
overhead of spinlock could be very hard if there is contention. With
block it's more uniform which CPU tw comes from, but with network it
could be much more random. That's what Dylan measured back than, and
quite a similar situation that you've seen yourself before is with
socket locks.

Another option is to try out how a lockless list (instead of stack)
with double cmpxchg would perform.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 16:00                 ` Pavel Begunkov
@ 2024-11-21 16:05                   ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2024-11-21 16:05 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring

On 11/21/24 9:00 AM, Pavel Begunkov wrote:
> On 11/21/24 15:22, Jens Axboe wrote:
>> On 11/21/24 8:15 AM, Jens Axboe wrote:
>>> I'd rather entertain NOT using llists for this in the first place, as it
>>> gets rid of the reversing which is the main cost here. That won't change
>>> the need for a retry list necessarily, as I think we'd be better off
>>> with a lockless retry list still. But at least it'd get rid of the
>>> reversing. Let me see if I can dig out that patch... Totally orthogonal
>>> to this topic, obviously.
>>
>> It's here:
>>
>> https://lore.kernel.org/io-uring/[email protected]/
>>
>> I did improve it further but never posted it again, fwiw.
> It's nice that with sth like that we're not restricted by space and be
> smarter about batching, e.g. splitting nr_tw into buckets. However, the
> overhead of spinlock could be very hard if there is contention. With

This is true, but it's also equally true for the llist - if you have
contention on adding vs running, then you'll be bouncing the cacheline
regardless. With the spinlock, you also have the added overhead of the
IRQ disabling.

> block it's more uniform which CPU tw comes from, but with network it
> could be much more random. That's what Dylan measured back than, and
> quite a similar situation that you've seen yourself before is with
> socket locks.

I'm sure he found the llist to be preferable, but that was also before
we had to add the reversing. So not so clear cut anymore, may push it
over the edge as I bet there was not much of a difference before. At
least when I benchmarked this back in March, it wasn't like llist was
a clear winner.

> Another option is to try out how a lockless list (instead of stack)
> with double cmpxchg would perform.

I did ponder that and even talked to Paul about it as well, but I never
benchmarked that one. I can try and resurrect that effort. One annoyance
there is that we need arch support, but I don't think it's a big deal as
the alternative is just to fallback to spinlock + normal list if it's
not available.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 15:22               ` Jens Axboe
  2024-11-21 16:00                 ` Pavel Begunkov
@ 2024-11-21 16:18                 ` Pavel Begunkov
  2024-11-21 16:20                   ` Jens Axboe
  1 sibling, 1 reply; 30+ messages in thread
From: Pavel Begunkov @ 2024-11-21 16:18 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring

On 11/21/24 15:22, Jens Axboe wrote:
> On 11/21/24 8:15 AM, Jens Axboe wrote:
>> I'd rather entertain NOT using llists for this in the first place, as it
>> gets rid of the reversing which is the main cost here. That won't change
>> the need for a retry list necessarily, as I think we'd be better off
>> with a lockless retry list still. But at least it'd get rid of the
>> reversing. Let me see if I can dig out that patch... Totally orthogonal
>> to this topic, obviously.
> 
> It's here:
> 
> https://lore.kernel.org/io-uring/[email protected]/
> 
> I did improve it further but never posted it again, fwiw.

io_req_local_work_add() needs a smp_mb() after unlock, see comments,
release/unlock doesn't do it.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 16:18                 ` Pavel Begunkov
@ 2024-11-21 16:20                   ` Jens Axboe
  2024-11-21 16:43                     ` Pavel Begunkov
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2024-11-21 16:20 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring

On 11/21/24 9:18 AM, Pavel Begunkov wrote:
> On 11/21/24 15:22, Jens Axboe wrote:
>> On 11/21/24 8:15 AM, Jens Axboe wrote:
>>> I'd rather entertain NOT using llists for this in the first place, as it
>>> gets rid of the reversing which is the main cost here. That won't change
>>> the need for a retry list necessarily, as I think we'd be better off
>>> with a lockless retry list still. But at least it'd get rid of the
>>> reversing. Let me see if I can dig out that patch... Totally orthogonal
>>> to this topic, obviously.
>>
>> It's here:
>>
>> https://lore.kernel.org/io-uring/[email protected]/
>>
>> I did improve it further but never posted it again, fwiw.
> 
> io_req_local_work_add() needs a smp_mb() after unlock, see comments,
> release/unlock doesn't do it.

Yep, current version I have adds a smp_mb__after_unlock_lock() for that.

Will do some quick testing, but then also try the double cmpxchg on top
of that if supported.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 16:20                   ` Jens Axboe
@ 2024-11-21 16:43                     ` Pavel Begunkov
  2024-11-21 16:57                       ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Pavel Begunkov @ 2024-11-21 16:43 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring

On 11/21/24 16:20, Jens Axboe wrote:
> On 11/21/24 9:18 AM, Pavel Begunkov wrote:
>> On 11/21/24 15:22, Jens Axboe wrote:
>>> On 11/21/24 8:15 AM, Jens Axboe wrote:
>>>> I'd rather entertain NOT using llists for this in the first place, as it
>>>> gets rid of the reversing which is the main cost here. That won't change
>>>> the need for a retry list necessarily, as I think we'd be better off
>>>> with a lockless retry list still. But at least it'd get rid of the
>>>> reversing. Let me see if I can dig out that patch... Totally orthogonal
>>>> to this topic, obviously.
>>>
>>> It's here:
>>>
>>> https://lore.kernel.org/io-uring/[email protected]/
>>>
>>> I did improve it further but never posted it again, fwiw.
>>
>> io_req_local_work_add() needs a smp_mb() after unlock, see comments,
>> release/unlock doesn't do it.
> 
> Yep, current version I have adds a smp_mb__after_unlock_lock() for that.

I don't think it'd be correct. unlock_lock AFAIK is specifically
for unlock + lock, you have lock + unlock. And data you want to
synchronise is modified after the lock part. That'd need upgrading
the release semantics implied by the unlock to a full barrier.

I doubt there is a good way to optimise it. I doubt it'd give you
anything even if you replace store_release in spin_unlock with xchg()
and ignore the return, but you can probably ask Paul.


> Will do some quick testing, but then also try the double cmpxchg on top
> of that if supported.
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 16:43                     ` Pavel Begunkov
@ 2024-11-21 16:57                       ` Jens Axboe
  2024-11-21 17:05                         ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2024-11-21 16:57 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring

On 11/21/24 9:43 AM, Pavel Begunkov wrote:
> On 11/21/24 16:20, Jens Axboe wrote:
>> On 11/21/24 9:18 AM, Pavel Begunkov wrote:
>>> On 11/21/24 15:22, Jens Axboe wrote:
>>>> On 11/21/24 8:15 AM, Jens Axboe wrote:
>>>>> I'd rather entertain NOT using llists for this in the first place, as it
>>>>> gets rid of the reversing which is the main cost here. That won't change
>>>>> the need for a retry list necessarily, as I think we'd be better off
>>>>> with a lockless retry list still. But at least it'd get rid of the
>>>>> reversing. Let me see if I can dig out that patch... Totally orthogonal
>>>>> to this topic, obviously.
>>>>
>>>> It's here:
>>>>
>>>> https://lore.kernel.org/io-uring/[email protected]/
>>>>
>>>> I did improve it further but never posted it again, fwiw.
>>>
>>> io_req_local_work_add() needs a smp_mb() after unlock, see comments,
>>> release/unlock doesn't do it.
>>
>> Yep, current version I have adds a smp_mb__after_unlock_lock() for that.
> 
> I don't think it'd be correct. unlock_lock AFAIK is specifically
> for unlock + lock, you have lock + unlock. And data you want to
> synchronise is modified after the lock part. That'd need upgrading
> the release semantics implied by the unlock to a full barrier.
> 
> I doubt there is a good way to optimise it. I doubt it'd give you
> anything even if you replace store_release in spin_unlock with xchg()
> and ignore the return, but you can probably ask Paul.

True, will just make it an smp_mb(), should be fine.

>> Will do some quick testing, but then also try the double cmpxchg on top
>> of that if supported.

This is a bit trickier, as we're not just updating the list first/last
entries... Not sure I see a good way to do this. Maybe I'm missing
something.

I did run a basic IRQ storage test as-is, and will compare that with the
llist stuff we have now. Just in terms of overhead. It's not quite a
networking test, but you do get the IRQ side and some burstiness in
terms of completions that way too, at high rates. So should be roughly
comparable.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 16:57                       ` Jens Axboe
@ 2024-11-21 17:05                         ` Jens Axboe
  2024-11-22 17:01                           ` Pavel Begunkov
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2024-11-21 17:05 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring

[-- Attachment #1: Type: text/plain, Size: 701 bytes --]

On 11/21/24 9:57 AM, Jens Axboe wrote:
> I did run a basic IRQ storage test as-is, and will compare that with the
> llist stuff we have now. Just in terms of overhead. It's not quite a
> networking test, but you do get the IRQ side and some burstiness in
> terms of completions that way too, at high rates. So should be roughly
> comparable.

Perf looks comparable, it's about 60M IOPS. Some fluctuation with IRQ
driven, so won't render an opinion on whether one is faster than the
other. What is visible though is that adding and running local task_work
drops from 2.39% to 2.02% using spinlock + io_wq_work_list over llist,
and we entirely drop 2.2% of list reversing in the process.

-- 
Jens Axboe

[-- Attachment #2: 0002-io_uring-replace-defer-task_work-llist-with-io_wq_wo.patch --]
[-- Type: text/x-patch, Size: 11165 bytes --]

From 3690e69ed6c7a4115acc56735dce4434b1105cc3 Mon Sep 17 00:00:00 2001
From: Jens Axboe <[email protected]>
Date: Thu, 21 Nov 2024 09:18:19 -0700
Subject: [PATCH 2/2] io_uring: replace defer task_work llist with
 io_wq_work_list

Add a spinlock for the list, and replace the lockless llist with the
work list instead. This avoids needing to reverse items in the list
before running them, as the io_wq_work_list is FIFO by nature whereas
the llist is LIFO.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/io_uring_types.h |  12 +-
 io_uring/io_uring.c            | 196 ++++++++++++++++-----------------
 io_uring/io_uring.h            |   2 +-
 3 files changed, 101 insertions(+), 109 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 011860ade268..03a102ca8105 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -335,8 +335,9 @@ struct io_ring_ctx {
 	 * regularly bounce b/w CPUs.
 	 */
 	struct {
-		struct llist_head	work_llist;
-		struct llist_head	retry_llist;
+		struct io_wq_work_list	work_list;
+		spinlock_t		work_lock;
+		int			work_items;
 		unsigned long		check_cq;
 		atomic_t		cq_wait_nr;
 		atomic_t		cq_timeouts;
@@ -566,7 +567,10 @@ enum {
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
 
 struct io_task_work {
-	struct llist_node		node;
+	union {
+		struct io_wq_work_node	work_node;
+		struct llist_node	node;
+	};
 	io_req_tw_func_t		func;
 };
 
@@ -622,8 +626,6 @@ struct io_kiocb {
 	 */
 	u16				buf_index;
 
-	unsigned			nr_tw;
-
 	/* REQ_F_* flags */
 	io_req_flags_t			flags;
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index c3a7d0197636..07e63e532797 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -338,9 +338,10 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_LIST_HEAD(&ctx->io_buffers_comp);
 	INIT_LIST_HEAD(&ctx->defer_list);
 	INIT_LIST_HEAD(&ctx->timeout_list);
-	INIT_LIST_HEAD(&ctx->ltimeout_list);
-	init_llist_head(&ctx->work_llist);
 	INIT_LIST_HEAD(&ctx->tctx_list);
+	INIT_LIST_HEAD(&ctx->ltimeout_list);
+	INIT_WQ_LIST(&ctx->work_list);
+	spin_lock_init(&ctx->work_lock);
 	ctx->submit_state.free_list.next = NULL;
 	INIT_HLIST_HEAD(&ctx->waitid_list);
 #ifdef CONFIG_FUTEX
@@ -1066,25 +1067,32 @@ struct llist_node *io_handle_tw_list(struct llist_node *node,
 	return node;
 }
 
-static __cold void __io_fallback_tw(struct llist_node *node, bool sync)
+static __cold void __io_fallback_tw(struct io_kiocb *req, bool sync,
+				    struct io_ring_ctx **last_ctx)
+{
+	if (sync && *last_ctx != req->ctx) {
+		if (*last_ctx) {
+			flush_delayed_work(&(*last_ctx)->fallback_work);
+			percpu_ref_put(&(*last_ctx)->refs);
+		}
+		*last_ctx = req->ctx;
+		percpu_ref_get(&(*last_ctx)->refs);
+	}
+	if (llist_add(&req->io_task_work.node, &req->ctx->fallback_llist))
+		schedule_delayed_work(&req->ctx->fallback_work, 1);
+
+}
+
+static void io_fallback_tw(struct io_uring_task *tctx, bool sync)
 {
+	struct llist_node *node = llist_del_all(&tctx->task_list);
 	struct io_ring_ctx *last_ctx = NULL;
 	struct io_kiocb *req;
 
 	while (node) {
 		req = container_of(node, struct io_kiocb, io_task_work.node);
 		node = node->next;
-		if (sync && last_ctx != req->ctx) {
-			if (last_ctx) {
-				flush_delayed_work(&last_ctx->fallback_work);
-				percpu_ref_put(&last_ctx->refs);
-			}
-			last_ctx = req->ctx;
-			percpu_ref_get(&last_ctx->refs);
-		}
-		if (llist_add(&req->io_task_work.node,
-			      &req->ctx->fallback_llist))
-			schedule_delayed_work(&req->ctx->fallback_work, 1);
+		__io_fallback_tw(req, sync, &last_ctx);
 	}
 
 	if (last_ctx) {
@@ -1093,13 +1101,6 @@ static __cold void __io_fallback_tw(struct llist_node *node, bool sync)
 	}
 }
 
-static void io_fallback_tw(struct io_uring_task *tctx, bool sync)
-{
-	struct llist_node *node = llist_del_all(&tctx->task_list);
-
-	__io_fallback_tw(node, sync);
-}
-
 struct llist_node *tctx_task_work_run(struct io_uring_task *tctx,
 				      unsigned int max_entries,
 				      unsigned int *count)
@@ -1139,73 +1140,45 @@ void tctx_task_work(struct callback_head *cb)
 
 static inline void io_req_local_work_add(struct io_kiocb *req,
 					 struct io_ring_ctx *ctx,
-					 unsigned flags)
+					 unsigned tw_flags)
 {
-	unsigned nr_wait, nr_tw, nr_tw_prev;
-	struct llist_node *head;
+	unsigned long flags;
+	unsigned nr_tw;
 
 	/* See comment above IO_CQ_WAKE_INIT */
 	BUILD_BUG_ON(IO_CQ_WAKE_FORCE <= IORING_MAX_CQ_ENTRIES);
 
 	/*
-	 * We don't know how many reuqests is there in the link and whether
-	 * they can even be queued lazily, fall back to non-lazy.
+	 * We don't know how many requests are in the link and whether they can
+	 * even be queued lazily, fall back to non-lazy.
 	 */
 	if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK))
-		flags &= ~IOU_F_TWQ_LAZY_WAKE;
+		tw_flags &= ~IOU_F_TWQ_LAZY_WAKE;
 
-	guard(rcu)();
-
-	head = READ_ONCE(ctx->work_llist.first);
-	do {
-		nr_tw_prev = 0;
-		if (head) {
-			struct io_kiocb *first_req = container_of(head,
-							struct io_kiocb,
-							io_task_work.node);
-			/*
-			 * Might be executed at any moment, rely on
-			 * SLAB_TYPESAFE_BY_RCU to keep it alive.
-			 */
-			nr_tw_prev = READ_ONCE(first_req->nr_tw);
-		}
-
-		/*
-		 * Theoretically, it can overflow, but that's fine as one of
-		 * previous adds should've tried to wake the task.
-		 */
-		nr_tw = nr_tw_prev + 1;
-		if (!(flags & IOU_F_TWQ_LAZY_WAKE))
-			nr_tw = IO_CQ_WAKE_FORCE;
-
-		req->nr_tw = nr_tw;
-		req->io_task_work.node.next = head;
-	} while (!try_cmpxchg(&ctx->work_llist.first, &head,
-			      &req->io_task_work.node));
+	spin_lock_irqsave(&ctx->work_lock, flags);
+	wq_list_add_tail(&req->io_task_work.work_node, &ctx->work_list);
+	nr_tw = ++ctx->work_items;
+	if (!(tw_flags & IOU_F_TWQ_LAZY_WAKE))
+		nr_tw = IO_CQ_WAKE_FORCE;
+	spin_unlock_irqrestore(&ctx->work_lock, flags);
 
 	/*
-	 * cmpxchg implies a full barrier, which pairs with the barrier
+	 * We need a barrier after unlock, which pairs with the barrier
 	 * in set_current_state() on the io_cqring_wait() side. It's used
 	 * to ensure that either we see updated ->cq_wait_nr, or waiters
 	 * going to sleep will observe the work added to the list, which
 	 * is similar to the wait/wawke task state sync.
 	 */
-
-	if (!head) {
+	smp_mb();
+	if (nr_tw == 1) {
 		if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
 			atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
 		if (ctx->has_evfd)
 			io_eventfd_signal(ctx);
 	}
 
-	nr_wait = atomic_read(&ctx->cq_wait_nr);
-	/* not enough or no one is waiting */
-	if (nr_tw < nr_wait)
-		return;
-	/* the previous add has already woken it up */
-	if (nr_tw_prev >= nr_wait)
-		return;
-	wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
+	if (nr_tw >= atomic_read(&ctx->cq_wait_nr))
+		wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
 }
 
 static void io_req_normal_work_add(struct io_kiocb *req)
@@ -1253,11 +1226,27 @@ void io_req_task_work_add_remote(struct io_kiocb *req, struct io_ring_ctx *ctx,
 
 static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 {
-	struct llist_node *node = llist_del_all(&ctx->work_llist);
+	struct io_ring_ctx *last_ctx = NULL;
+	struct io_wq_work_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ctx->work_lock, flags);
+	node = ctx->work_list.first;
+	INIT_WQ_LIST(&ctx->work_list);
+	ctx->work_items = 0;
+	spin_unlock_irqrestore(&ctx->work_lock, flags);
+
+	while (node) {
+		struct io_kiocb *req;
 
-	__io_fallback_tw(node, false);
-	node = llist_del_all(&ctx->retry_llist);
-	__io_fallback_tw(node, false);
+		req = container_of(node, struct io_kiocb, io_task_work.work_node);
+		node = node->next;
+		__io_fallback_tw(req, false, &last_ctx);
+	}
+	if (last_ctx) {
+		flush_delayed_work(&last_ctx->fallback_work);
+		percpu_ref_put(&last_ctx->refs);
+	}
 }
 
 static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
@@ -1272,51 +1261,52 @@ static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
 	return false;
 }
 
-static int __io_run_local_work_loop(struct llist_node **node,
-				    struct io_tw_state *ts,
-				    int events)
-{
-	while (*node) {
-		struct llist_node *next = (*node)->next;
-		struct io_kiocb *req = container_of(*node, struct io_kiocb,
-						    io_task_work.node);
-		INDIRECT_CALL_2(req->io_task_work.func,
-				io_poll_task_func, io_req_rw_complete,
-				req, ts);
-		*node = next;
-		if (--events <= 0)
-			break;
-	}
-
-	return events;
-}
-
 static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts,
 			       int min_events)
 {
-	struct llist_node *node;
-	unsigned int loops = 0;
-	int ret, limit;
+	struct io_wq_work_node *node, *tail;
+	unsigned int loops = 1;
+	int ret, limit, nitems;
+	unsigned long flags;
 
 	if (WARN_ON_ONCE(ctx->submitter_task != current))
 		return -EEXIST;
 	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
 		atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
 	limit = max(IO_LOCAL_TW_DEFAULT_MAX, min_events);
+
 again:
-	ret = __io_run_local_work_loop(&ctx->retry_llist.first, ts, limit);
-	if (ctx->retry_llist.first)
+	spin_lock_irqsave(&ctx->work_lock, flags);
+	node = ctx->work_list.first;
+	tail = ctx->work_list.last;
+	nitems = ctx->work_items;
+	INIT_WQ_LIST(&ctx->work_list);
+	ctx->work_items = 0;
+	spin_unlock_irqrestore(&ctx->work_lock, flags);
+
+	while (node) {
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.work_node);
+		node = node->next;
+		INDIRECT_CALL_2(req->io_task_work.func,
+				io_poll_task_func, io_req_rw_complete,
+				req, ts);
+		if (++ret >= limit)
+			break;
+	}
+
+	if (unlikely(node)) {
+		spin_lock_irqsave(&ctx->work_lock, flags);
+		tail->next = ctx->work_list.first;
+		ctx->work_list.first = node;
+		if (!ctx->work_list.last)
+			ctx->work_list.last = tail;
+		ctx->work_items += nitems - ret;
+		spin_unlock_irqrestore(&ctx->work_lock, flags);
 		goto retry_done;
+	}
 
-	/*
-	 * llists are in reverse order, flip it back the right way before
-	 * running the pending items.
-	 */
-	node = llist_reverse_order(llist_del_all(&ctx->work_llist));
-	ret = __io_run_local_work_loop(&node, ts, ret);
-	ctx->retry_llist.first = node;
 	loops++;
-
 	ret = limit - ret;
 	if (io_run_local_work_continue(ctx, ret, min_events))
 		goto again;
@@ -2413,7 +2403,7 @@ static enum hrtimer_restart io_cqring_min_timer_wakeup(struct hrtimer *timer)
 	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
 		atomic_set(&ctx->cq_wait_nr, 1);
 		smp_mb();
-		if (!llist_empty(&ctx->work_llist))
+		if (io_local_work_pending(ctx))
 			goto out_wake;
 	}
 
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 214f9f175102..2fae27803116 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -349,7 +349,7 @@ static inline int io_run_task_work(void)
 
 static inline bool io_local_work_pending(struct io_ring_ctx *ctx)
 {
-	return !llist_empty(&ctx->work_llist) || !llist_empty(&ctx->retry_llist);
+	return READ_ONCE(ctx->work_list.first);
 }
 
 static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
-- 
2.45.2


[-- Attachment #3: 0001-io_uring-make-task_work-pending-check-dependent-on-r.patch --]
[-- Type: text/x-patch, Size: 1141 bytes --]

From efb35f94fd1d66690fe31f0ee578479e0787bbf8 Mon Sep 17 00:00:00 2001
From: Jens Axboe <[email protected]>
Date: Thu, 21 Nov 2024 08:27:10 -0700
Subject: [PATCH 1/2] io_uring: make task_work pending check dependent on ring
 type

There's no need to check for generic task_work for DEFER_TASKRUN, if we
have local task_work pending. This avoids dipping into the huge
task_struct, if we have normal task_work pending.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/io_uring.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 12abee607e4a..214f9f175102 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -354,7 +354,9 @@ static inline bool io_local_work_pending(struct io_ring_ctx *ctx)
 
 static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
 {
-	return task_work_pending(current) || io_local_work_pending(ctx);
+	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN && io_local_work_pending(ctx))
+		return true;
+	return task_work_pending(current);
 }
 
 static inline void io_tw_lock(struct io_ring_ctx *ctx, struct io_tw_state *ts)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 15:07           ` Pavel Begunkov
  2024-11-21 15:15             ` Jens Axboe
@ 2024-11-21 17:53             ` David Wei
  2024-11-22 15:57               ` Pavel Begunkov
  1 sibling, 1 reply; 30+ messages in thread
From: David Wei @ 2024-11-21 17:53 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, io-uring

On 2024-11-21 07:07, Pavel Begunkov wrote:
> On 11/21/24 14:31, Jens Axboe wrote:
>> On 11/21/24 7:25 AM, Pavel Begunkov wrote:
>>> On 11/21/24 01:12, Jens Axboe wrote:
>>>> On 11/20/24 4:56 PM, Pavel Begunkov wrote:
>>>>> On 11/20/24 22:14, David Wei wrote:
> ...
>>>> I think that can only work if we change work_llist to be a regular list
>>>> with regular locking. Otherwise it's a bit of a mess with the list being
>>>
>>> Dylan once measured the overhead of locks vs atomics in this
>>> path for some artificial case, we can pull the numbers up.
>>
>> I did it more recently if you'll remember, actually posted a patch I
>> think a few months ago changing it to that. But even that approach adds
> 
> Right, and it's be a separate topic from this set.
> 
>> extra overhead, if you want to add it to the same list as now you need
> 
> Extra overhead to the retry path, which is not the hot path,
> and coldness of it is uncertain.
> 
>> to re-grab (and re-disable interrupts) the lock to add it back. My gut
>> says that would be _worse_ than the current approach. And if you keep a
>> separate list instead, well then you're back to identical overhead in
>> terms of now needing to check both when needing to know if anything is
>> pending, and checking both when running it.
>>
>>>> reordered, and then you're spending extra cycles on potentially
>>>> reordering all the entries again.
>>>
>>> That sucks, I agree, but then it's same question of how often
>>> it happens.
>>
>> At least for now, there's a real issue reported and we should fix it. I
>> think the current patches are fine in that regard. That doesn't mean we
>> can't potentially make it better, we should certainly investigate that.
>> But I don't see the current patches as being suboptimal really, they are
>> definitely good enough as-is for solving the issue.
> 
> That's fair enough, but I still would love to know how frequent
> it is. There is no purpose in optimising it as hot/slow path if
> it triggers every fifth run or such. David, how easy it is to
> get some stats? We can hack up some bpftrace script
> 

Here is a sample distribution of how many task work is done per
__io_run_local_work():

@work_done:
[1]             15385954  |@                                                   |
[2, 4)          33424809  |@@@@                                                |
[4, 8)          196055270 |@@@@@@@@@@@@@@@@@@@@@@@@                            |
[8, 16)         419060191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)        48395043  |@@@@@@                                              |
[32, 64)         1573469  |                                                    |
[64, 128)          98151  |                                                    |
[128, 256)         14288  |                                                    |
[256, 512)          2035  |                                                    |
[512, 1K)            268  |                                                    |
[1K, 2K)              13  |                                                    |

This workload had wait_nr set to 20 and the timeout set to 500 µs.

Empirically, I know that any task work done > 50 will violate the
latency limit for this workload. In these cases, all the requests must
be dropped. So even if excessive task work happens in a small % of time,
the impact is far larger than this.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 17:53             ` David Wei
@ 2024-11-22 15:57               ` Pavel Begunkov
  0 siblings, 0 replies; 30+ messages in thread
From: Pavel Begunkov @ 2024-11-22 15:57 UTC (permalink / raw)
  To: David Wei, Jens Axboe, io-uring

On 11/21/24 17:53, David Wei wrote:
> On 2024-11-21 07:07, Pavel Begunkov wrote:
>> On 11/21/24 14:31, Jens Axboe wrote:
>>> On 11/21/24 7:25 AM, Pavel Begunkov wrote:
>>>> On 11/21/24 01:12, Jens Axboe wrote:
>>>>> On 11/20/24 4:56 PM, Pavel Begunkov wrote:
>>>>>> On 11/20/24 22:14, David Wei wrote:
>> ...
>>>>> I think that can only work if we change work_llist to be a regular list
>>>>> with regular locking. Otherwise it's a bit of a mess with the list being
>>>>
>>>> Dylan once measured the overhead of locks vs atomics in this
>>>> path for some artificial case, we can pull the numbers up.
>>>
>>> I did it more recently if you'll remember, actually posted a patch I
>>> think a few months ago changing it to that. But even that approach adds
>>
>> Right, and it's be a separate topic from this set.
>>
>>> extra overhead, if you want to add it to the same list as now you need
>>
>> Extra overhead to the retry path, which is not the hot path,
>> and coldness of it is uncertain.
>>
>>> to re-grab (and re-disable interrupts) the lock to add it back. My gut
>>> says that would be _worse_ than the current approach. And if you keep a
>>> separate list instead, well then you're back to identical overhead in
>>> terms of now needing to check both when needing to know if anything is
>>> pending, and checking both when running it.
>>>
>>>>> reordered, and then you're spending extra cycles on potentially
>>>>> reordering all the entries again.
>>>>
>>>> That sucks, I agree, but then it's same question of how often
>>>> it happens.
>>>
>>> At least for now, there's a real issue reported and we should fix it. I
>>> think the current patches are fine in that regard. That doesn't mean we
>>> can't potentially make it better, we should certainly investigate that.
>>> But I don't see the current patches as being suboptimal really, they are
>>> definitely good enough as-is for solving the issue.
>>
>> That's fair enough, but I still would love to know how frequent
>> it is. There is no purpose in optimising it as hot/slow path if
>> it triggers every fifth run or such. David, how easy it is to
>> get some stats? We can hack up some bpftrace script
>>
> 
> Here is a sample distribution of how many task work is done per
> __io_run_local_work():
> 
> @work_done:
> [1]             15385954  |@                                                   |
> [2, 4)          33424809  |@@@@                                                |
> [4, 8)          196055270 |@@@@@@@@@@@@@@@@@@@@@@@@                            |
> [8, 16)         419060191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16, 32)        48395043  |@@@@@@                                              |
> [32, 64)         1573469  |                                                    |
> [64, 128)          98151  |                                                    |
> [128, 256)         14288  |                                                    |
> [256, 512)          2035  |                                                    |
> [512, 1K)            268  |                                                    |
> [1K, 2K)              13  |                                                    |

Nice

> This workload had wait_nr set to 20 and the timeout set to 500 µs.
> 
> Empirically, I know that any task work done > 50 will violate the
> latency limit for this workload. In these cases, all the requests must
> be dropped. So even if excessive task work happens in a small % of time,
> the impact is far larger than this.

So you've got a long tail, which spikes your nines, that makes sense.
On the other hand it's perhaps 5-10% of total, though hard to judge
as the [16,32) bucket is split by the constant 20. My guess would be
a small optimisation for the normal case adding a bit more to the
requeue may well worth it but depends on how sharp the skew in the
bucket is.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-21 17:05                         ` Jens Axboe
@ 2024-11-22 17:01                           ` Pavel Begunkov
  2024-11-22 17:08                             ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Pavel Begunkov @ 2024-11-22 17:01 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring

On 11/21/24 17:05, Jens Axboe wrote:
> On 11/21/24 9:57 AM, Jens Axboe wrote:
>> I did run a basic IRQ storage test as-is, and will compare that with the
>> llist stuff we have now. Just in terms of overhead. It's not quite a
>> networking test, but you do get the IRQ side and some burstiness in
>> terms of completions that way too, at high rates. So should be roughly
>> comparable.
> 
> Perf looks comparable, it's about 60M IOPS. Some fluctuation with IRQ

60M with iopoll? That one normally shouldn't use use task_work

> driven, so won't render an opinion on whether one is faster than the
> other. What is visible though is that adding and running local task_work
> drops from 2.39% to 2.02% using spinlock + io_wq_work_list over llist,

Do you summed it up with io_req_local_work_add()? Just sounds a bit
weird since it's usually run off [soft]irq. I have doubts that part
became faster. Running could be, especially with high QD and
consistency of SSD. Btw, what QD was it? 32?

> and we entirely drop 2.2% of list reversing in the process.

We actually discussed it before but in some different patchset,
perf is not helpful much here, the overhead and cache loading
moves around a lot between functions.

I don't think we have a solid proof here, especially for networking
workloads, which tend to hammer it more from more CPUs. Can we run
some net benchmarks? Even better to do a good prod experiment.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-22 17:01                           ` Pavel Begunkov
@ 2024-11-22 17:08                             ` Jens Axboe
  2024-11-23  0:50                               ` Pavel Begunkov
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2024-11-22 17:08 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring

On 11/22/24 10:01 AM, Pavel Begunkov wrote:
> On 11/21/24 17:05, Jens Axboe wrote:
>> On 11/21/24 9:57 AM, Jens Axboe wrote:
>>> I did run a basic IRQ storage test as-is, and will compare that with the
>>> llist stuff we have now. Just in terms of overhead. It's not quite a
>>> networking test, but you do get the IRQ side and some burstiness in
>>> terms of completions that way too, at high rates. So should be roughly
>>> comparable.
>>
>> Perf looks comparable, it's about 60M IOPS. Some fluctuation with IRQ
> 
> 60M with iopoll? That one normally shouldn't use use task_work

Maybe that wasn't clear, but it's IRQ driven IO. Otherwise indeed
there'd be no task_work in use.

>> driven, so won't render an opinion on whether one is faster than the
>> other. What is visible though is that adding and running local task_work
>> drops from 2.39% to 2.02% using spinlock + io_wq_work_list over llist,
> 
> Do you summed it up with io_req_local_work_add()? Just sounds a bit
> weird since it's usually run off [soft]irq. I have doubts that part
> became faster. Running could be, especially with high QD and
> consistency of SSD. Btw, what QD was it? 32?

It may just trigger more in frequency in terms of profiling, since the
list reversal is done. Profiling isn't 100% exact.

>> and we entirely drop 2.2% of list reversing in the process.
> 
> We actually discussed it before but in some different patchset,
> perf is not helpful much here, the overhead and cache loading
> moves around a lot between functions.
> 
> I don't think we have a solid proof here, especially for networking
> workloads, which tend to hammer it more from more CPUs. Can we run
> some net benchmarks? Even better to do a good prod experiment.

Already in motion. I ran some here and didn't show any differences at
all, but task_work load was also fairly light. David is running the
networking side and we'll see what it says.

I don't particularly love list + lock for this, but at the end of the
day, the only real downside is the irq disabling nature of it.
Everything else is both simpler, and avoids the really annoying LIFO
nature of llist. I'd expect, all things being equal, that list + lock is
going to be ever so slightly slower. Both will bounce the list
cacheline, no difference in cost on that side. But when you add list
reversal to the mix, that's going to push it to being an overall win.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-22 17:08                             ` Jens Axboe
@ 2024-11-23  0:50                               ` Pavel Begunkov
  0 siblings, 0 replies; 30+ messages in thread
From: Pavel Begunkov @ 2024-11-23  0:50 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring

On 11/22/24 17:08, Jens Axboe wrote:
> On 11/22/24 10:01 AM, Pavel Begunkov wrote:
>> On 11/21/24 17:05, Jens Axboe wrote:
>>> On 11/21/24 9:57 AM, Jens Axboe wrote:
>>>> I did run a basic IRQ storage test as-is, and will compare that with the
>>>> llist stuff we have now. Just in terms of overhead. It's not quite a
>>>> networking test, but you do get the IRQ side and some burstiness in
>>>> terms of completions that way too, at high rates. So should be roughly
>>>> comparable.
>>>
>>> Perf looks comparable, it's about 60M IOPS. Some fluctuation with IRQ
>>
>> 60M with iopoll? That one normally shouldn't use use task_work
> 
> Maybe that wasn't clear, but it's IRQ driven IO. Otherwise indeed
> there'd be no task_work in use.
> 
>>> driven, so won't render an opinion on whether one is faster than the
>>> other. What is visible though is that adding and running local task_work
>>> drops from 2.39% to 2.02% using spinlock + io_wq_work_list over llist,
>>
>> Do you summed it up with io_req_local_work_add()? Just sounds a bit
>> weird since it's usually run off [soft]irq. I have doubts that part
>> became faster. Running could be, especially with high QD and
>> consistency of SSD. Btw, what QD was it? 32?

Why I asked about QD is because storage tests reliably give
you a list of QD task work items, the longer the list the
more expensive the reverse with washing out cache lines.

For QD=32 it's 32 entry list reversal, so I'd get if you're
seeing some perf imrpovement. With QD=1 would be the opposite.
With David's thing is similar, he gets a long list because of
wait based batching. Users who don't do it might get a worse
performance (which might be fine).

> It may just trigger more in frequency in terms of profiling, since the
> list reversal is done. Profiling isn't 100% exact.
> 
>>> and we entirely drop 2.2% of list reversing in the process.
>>
>> We actually discussed it before but in some different patchset,
>> perf is not helpful much here, the overhead and cache loading
>> moves around a lot between functions.
>>
>> I don't think we have a solid proof here, especially for networking
>> workloads, which tend to hammer it more from more CPUs. Can we run
>> some net benchmarks? Even better to do a good prod experiment.
> 
> Already in motion. I ran some here and didn't show any differences at
> all, but task_work load was also fairly light. David is running the
> networking side and we'll see what it says.

That's great, if it survives high traffic prod there should be
less need to worry about it in terms of regressing.

The eerie part is that we're switching it back and forth rediscovering
same problems. Even the reordering issue was mentioned and warned
about before the wait-free list got merged, but successfully ignored
until we've got latency issues. And now we're back the full circle.
Would be nice to find some peace (or something inarguably better).


> I don't particularly love list + lock for this, but at the end of the
> day, the only real downside is the irq disabling nature of it.
> Everything else is both simpler, and avoids the really annoying LIFO
> nature of llist. I'd expect, all things being equal, that list + lock is
> going to be ever so slightly slower. Both will bounce the list
> cacheline, no difference in cost on that side. But when you add list
> reversal to the mix, that's going to push it to being an overall win.
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2024-11-23  0:49 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-20 22:14 [PATCH next v1 0/2] limit local tw done David Wei
2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei
2024-11-20 23:45   ` Pavel Begunkov
2024-11-20 22:14 ` [PATCH next v1 2/2] io_uring: limit local tw done David Wei
2024-11-20 23:56   ` Pavel Begunkov
2024-11-21  0:52     ` David Wei
2024-11-21 14:29       ` Pavel Begunkov
2024-11-21 14:34         ` Jens Axboe
2024-11-21 14:58           ` Pavel Begunkov
2024-11-21 15:02             ` Jens Axboe
2024-11-21  1:12     ` Jens Axboe
2024-11-21 14:25       ` Pavel Begunkov
2024-11-21 14:31         ` Jens Axboe
2024-11-21 15:07           ` Pavel Begunkov
2024-11-21 15:15             ` Jens Axboe
2024-11-21 15:22               ` Jens Axboe
2024-11-21 16:00                 ` Pavel Begunkov
2024-11-21 16:05                   ` Jens Axboe
2024-11-21 16:18                 ` Pavel Begunkov
2024-11-21 16:20                   ` Jens Axboe
2024-11-21 16:43                     ` Pavel Begunkov
2024-11-21 16:57                       ` Jens Axboe
2024-11-21 17:05                         ` Jens Axboe
2024-11-22 17:01                           ` Pavel Begunkov
2024-11-22 17:08                             ` Jens Axboe
2024-11-23  0:50                               ` Pavel Begunkov
2024-11-21 17:53             ` David Wei
2024-11-22 15:57               ` Pavel Begunkov
2024-11-21  1:12 ` [PATCH next v1 0/2] " Jens Axboe
2024-11-21 14:16 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox