public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH next v1 0/2] limit local tw done
@ 2024-11-20 22:14 David Wei
  2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: David Wei @ 2024-11-20 22:14 UTC (permalink / raw)
  To: io-uring; +Cc: David Wei, Jens Axboe, Pavel Begunkov

Currently when running local tw it will eagerly try and complete all
available work. When there is a burst of work coming in, the list of
work in work_llist may be much larger than the requested batch count
wait_nr. Doing all of the work eagerly may cause latency issues for some
applications that do not want to spend so much time in the kernel.

Limit the amount of local tw done to the max of 20 (a heuristic) or
wait_nr. This then does not require any userspace changes.

Many applications will submit and wait with wait_nr = 1 to mean "wait
for _at least_ 1 completion, but do more work if available". This used
to mean "all work" but now that is capped to 20 requests. If users want
more work batching, then they can set wait_nr to > 20.

David Wei (2):
  io_uring: add io_local_work_pending()
  io_uring: limit local tw done

 include/linux/io_uring_types.h |  1 +
 io_uring/io_uring.c            | 57 +++++++++++++++++++++++-----------
 io_uring/io_uring.h            |  9 ++++--
 3 files changed, 47 insertions(+), 20 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH next v1 1/2] io_uring: add io_local_work_pending()
  2024-11-20 22:14 [PATCH next v1 0/2] limit local tw done David Wei
@ 2024-11-20 22:14 ` David Wei
  2024-11-20 23:45   ` Pavel Begunkov
  2024-11-20 22:14 ` [PATCH next v1 2/2] io_uring: limit local tw done David Wei
  2024-11-21  1:12 ` [PATCH next v1 0/2] " Jens Axboe
  2 siblings, 1 reply; 8+ messages in thread
From: David Wei @ 2024-11-20 22:14 UTC (permalink / raw)
  To: io-uring; +Cc: David Wei, Jens Axboe, Pavel Begunkov

In preparation for adding a new llist of tw to retry due to hitting the
tw limit, add a helper io_local_work_pending(). This function returns
true if there is any local tw pending. For now it only checks
ctx->work_llist.

Signed-off-by: David Wei <[email protected]>
---
 io_uring/io_uring.c | 14 +++++++-------
 io_uring/io_uring.h |  9 +++++++--
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 801293399883..83bf041d2648 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1260,7 +1260,7 @@ static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
 				       int min_events)
 {
-	if (llist_empty(&ctx->work_llist))
+	if (!io_local_work_pending(ctx))
 		return false;
 	if (events < min_events)
 		return true;
@@ -1313,7 +1313,7 @@ static inline int io_run_local_work_locked(struct io_ring_ctx *ctx,
 {
 	struct io_tw_state ts = {};
 
-	if (llist_empty(&ctx->work_llist))
+	if (!io_local_work_pending(ctx))
 		return 0;
 	return __io_run_local_work(ctx, &ts, min_events);
 }
@@ -2328,7 +2328,7 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
 
 int io_run_task_work_sig(struct io_ring_ctx *ctx)
 {
-	if (!llist_empty(&ctx->work_llist)) {
+	if (io_local_work_pending(ctx)) {
 		__set_current_state(TASK_RUNNING);
 		if (io_run_local_work(ctx, INT_MAX) > 0)
 			return 0;
@@ -2459,7 +2459,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 {
 	if (unlikely(READ_ONCE(ctx->check_cq)))
 		return 1;
-	if (unlikely(!llist_empty(&ctx->work_llist)))
+	if (unlikely(io_local_work_pending(ctx)))
 		return 1;
 	if (unlikely(task_work_pending(current)))
 		return 1;
@@ -2493,7 +2493,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
 
 	if (!io_allowed_run_tw(ctx))
 		return -EEXIST;
-	if (!llist_empty(&ctx->work_llist))
+	if (io_local_work_pending(ctx))
 		io_run_local_work(ctx, min_events);
 	io_run_task_work();
 
@@ -2564,7 +2564,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
 		 * If we got woken because of task_work being processed, run it
 		 * now rather than let the caller do another wait loop.
 		 */
-		if (!llist_empty(&ctx->work_llist))
+		if (io_local_work_pending(ctx))
 			io_run_local_work(ctx, nr_wait);
 		io_run_task_work();
 
@@ -3158,7 +3158,7 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
 		io_run_task_work();
 		io_uring_drop_tctx_refs(current);
 		xa_for_each(&tctx->xa, index, node) {
-			if (!llist_empty(&node->ctx->work_llist)) {
+			if (io_local_work_pending(node->ctx)) {
 				WARN_ON_ONCE(node->ctx->submitter_task &&
 					     node->ctx->submitter_task != current);
 				goto end_wait;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 4070d4c8ef97..69eb3b23a5a0 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -347,9 +347,14 @@ static inline int io_run_task_work(void)
 	return ret;
 }
 
+static inline bool io_local_work_pending(struct io_ring_ctx *ctx)
+{
+	return !llist_empty(&ctx->work_llist);
+}
+
 static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
 {
-	return task_work_pending(current) || !llist_empty(&ctx->work_llist);
+	return task_work_pending(current) || io_local_work_pending(ctx);
 }
 
 static inline void io_tw_lock(struct io_ring_ctx *ctx, struct io_tw_state *ts)
@@ -484,6 +489,6 @@ enum {
 static inline bool io_has_work(struct io_ring_ctx *ctx)
 {
 	return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq) ||
-	       !llist_empty(&ctx->work_llist);
+	       io_local_work_pending(ctx);
 }
 #endif
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-20 22:14 [PATCH next v1 0/2] limit local tw done David Wei
  2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei
@ 2024-11-20 22:14 ` David Wei
  2024-11-20 23:56   ` Pavel Begunkov
  2024-11-21  1:12 ` [PATCH next v1 0/2] " Jens Axboe
  2 siblings, 1 reply; 8+ messages in thread
From: David Wei @ 2024-11-20 22:14 UTC (permalink / raw)
  To: io-uring; +Cc: David Wei, Jens Axboe, Pavel Begunkov

Instead of eagerly running all available local tw, limit the amount of
local tw done to the max of IO_LOCAL_TW_DEFAULT_MAX (20) or wait_nr. The
value of 20 is chosen as a reasonable heuristic to allow enough work
batching but also keep latency down.

Add a retry_llist that maintains a list of local tw that couldn't be
done in time. No synchronisation is needed since it is only modified
within the task context.

Signed-off-by: David Wei <[email protected]>
---
 include/linux/io_uring_types.h |  1 +
 io_uring/io_uring.c            | 43 +++++++++++++++++++++++++---------
 io_uring/io_uring.h            |  2 +-
 3 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 593c10a02144..011860ade268 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -336,6 +336,7 @@ struct io_ring_ctx {
 	 */
 	struct {
 		struct llist_head	work_llist;
+		struct llist_head	retry_llist;
 		unsigned long		check_cq;
 		atomic_t		cq_wait_nr;
 		atomic_t		cq_timeouts;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 83bf041d2648..c3a7d0197636 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -121,6 +121,7 @@
 
 #define IO_COMPL_BATCH			32
 #define IO_REQ_ALLOC_BATCH		8
+#define IO_LOCAL_TW_DEFAULT_MAX		20
 
 struct io_defer_entry {
 	struct list_head	list;
@@ -1255,6 +1256,8 @@ static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 	struct llist_node *node = llist_del_all(&ctx->work_llist);
 
 	__io_fallback_tw(node, false);
+	node = llist_del_all(&ctx->retry_llist);
+	__io_fallback_tw(node, false);
 }
 
 static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
@@ -1269,37 +1272,55 @@ static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
 	return false;
 }
 
+static int __io_run_local_work_loop(struct llist_node **node,
+				    struct io_tw_state *ts,
+				    int events)
+{
+	while (*node) {
+		struct llist_node *next = (*node)->next;
+		struct io_kiocb *req = container_of(*node, struct io_kiocb,
+						    io_task_work.node);
+		INDIRECT_CALL_2(req->io_task_work.func,
+				io_poll_task_func, io_req_rw_complete,
+				req, ts);
+		*node = next;
+		if (--events <= 0)
+			break;
+	}
+
+	return events;
+}
+
 static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts,
 			       int min_events)
 {
 	struct llist_node *node;
 	unsigned int loops = 0;
-	int ret = 0;
+	int ret, limit;
 
 	if (WARN_ON_ONCE(ctx->submitter_task != current))
 		return -EEXIST;
 	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
 		atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
+	limit = max(IO_LOCAL_TW_DEFAULT_MAX, min_events);
 again:
+	ret = __io_run_local_work_loop(&ctx->retry_llist.first, ts, limit);
+	if (ctx->retry_llist.first)
+		goto retry_done;
+
 	/*
 	 * llists are in reverse order, flip it back the right way before
 	 * running the pending items.
 	 */
 	node = llist_reverse_order(llist_del_all(&ctx->work_llist));
-	while (node) {
-		struct llist_node *next = node->next;
-		struct io_kiocb *req = container_of(node, struct io_kiocb,
-						    io_task_work.node);
-		INDIRECT_CALL_2(req->io_task_work.func,
-				io_poll_task_func, io_req_rw_complete,
-				req, ts);
-		ret++;
-		node = next;
-	}
+	ret = __io_run_local_work_loop(&node, ts, ret);
+	ctx->retry_llist.first = node;
 	loops++;
 
+	ret = limit - ret;
 	if (io_run_local_work_continue(ctx, ret, min_events))
 		goto again;
+retry_done:
 	io_submit_flush_completions(ctx);
 	if (io_run_local_work_continue(ctx, ret, min_events))
 		goto again;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 69eb3b23a5a0..12abee607e4a 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -349,7 +349,7 @@ static inline int io_run_task_work(void)
 
 static inline bool io_local_work_pending(struct io_ring_ctx *ctx)
 {
-	return !llist_empty(&ctx->work_llist);
+	return !llist_empty(&ctx->work_llist) || !llist_empty(&ctx->retry_llist);
 }
 
 static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH next v1 1/2] io_uring: add io_local_work_pending()
  2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei
@ 2024-11-20 23:45   ` Pavel Begunkov
  0 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2024-11-20 23:45 UTC (permalink / raw)
  To: David Wei, io-uring; +Cc: Jens Axboe

On 11/20/24 22:14, David Wei wrote:
> In preparation for adding a new llist of tw to retry due to hitting the
> tw limit, add a helper io_local_work_pending(). This function returns
> true if there is any local tw pending. For now it only checks
> ctx->work_llist.

Looks clean, we can even take it separately from 2/2

Reviewed-by: Pavel Begunkov <[email protected]>

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-20 22:14 ` [PATCH next v1 2/2] io_uring: limit local tw done David Wei
@ 2024-11-20 23:56   ` Pavel Begunkov
  2024-11-21  0:52     ` David Wei
  2024-11-21  1:12     ` Jens Axboe
  0 siblings, 2 replies; 8+ messages in thread
From: Pavel Begunkov @ 2024-11-20 23:56 UTC (permalink / raw)
  To: David Wei, io-uring; +Cc: Jens Axboe

On 11/20/24 22:14, David Wei wrote:
> Instead of eagerly running all available local tw, limit the amount of
> local tw done to the max of IO_LOCAL_TW_DEFAULT_MAX (20) or wait_nr. The
> value of 20 is chosen as a reasonable heuristic to allow enough work
> batching but also keep latency down.
> 
> Add a retry_llist that maintains a list of local tw that couldn't be
> done in time. No synchronisation is needed since it is only modified
> within the task context.
> 
> Signed-off-by: David Wei <[email protected]>
> ---
>   include/linux/io_uring_types.h |  1 +
>   io_uring/io_uring.c            | 43 +++++++++++++++++++++++++---------
>   io_uring/io_uring.h            |  2 +-
>   3 files changed, 34 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 593c10a02144..011860ade268 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -336,6 +336,7 @@ struct io_ring_ctx {
>   	 */
>   	struct {
>   		struct llist_head	work_llist;
> +		struct llist_head	retry_llist;

Fwiw, probably doesn't matter, but it doesn't even need
to be atomic, it's queued and spliced while holding
->uring_lock, the pending check is also synchronised as
there is only one possible task doing that.

>   		unsigned long		check_cq;
>   		atomic_t		cq_wait_nr;
>   		atomic_t		cq_timeouts;
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 83bf041d2648..c3a7d0197636 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -121,6 +121,7 @@
...
>   static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts,
>   			       int min_events)
>   {
>   	struct llist_node *node;
>   	unsigned int loops = 0;
> -	int ret = 0;
> +	int ret, limit;
>   
>   	if (WARN_ON_ONCE(ctx->submitter_task != current))
>   		return -EEXIST;
>   	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
>   		atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
> +	limit = max(IO_LOCAL_TW_DEFAULT_MAX, min_events);
>   again:
> +	ret = __io_run_local_work_loop(&ctx->retry_llist.first, ts, limit);
> +	if (ctx->retry_llist.first)
> +		goto retry_done;
> +
>   	/*
>   	 * llists are in reverse order, flip it back the right way before
>   	 * running the pending items.
>   	 */
>   	node = llist_reverse_order(llist_del_all(&ctx->work_llist));
> -	while (node) {
> -		struct llist_node *next = node->next;
> -		struct io_kiocb *req = container_of(node, struct io_kiocb,
> -						    io_task_work.node);
> -		INDIRECT_CALL_2(req->io_task_work.func,
> -				io_poll_task_func, io_req_rw_complete,
> -				req, ts);
> -		ret++;
> -		node = next;
> -	}
> +	ret = __io_run_local_work_loop(&node, ts, ret);

One thing that is not so nice is that now we have this handling and
checks in the hot path, and __io_run_local_work_loop() most likely
gets uninlined.

I wonder, can we just requeue it via task_work again? We can even
add a variant efficiently adding a list instead of a single entry,
i.e. local_task_work_add(head, tail, ...);

I'm also curious what's the use case you've got that is hitting
the problem?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-20 23:56   ` Pavel Begunkov
@ 2024-11-21  0:52     ` David Wei
  2024-11-21  1:12     ` Jens Axboe
  1 sibling, 0 replies; 8+ messages in thread
From: David Wei @ 2024-11-21  0:52 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Jens Axboe

On 2024-11-20 15:56, Pavel Begunkov wrote:
> On 11/20/24 22:14, David Wei wrote:
>> Instead of eagerly running all available local tw, limit the amount of
>> local tw done to the max of IO_LOCAL_TW_DEFAULT_MAX (20) or wait_nr. The
>> value of 20 is chosen as a reasonable heuristic to allow enough work
>> batching but also keep latency down.
>>
>> Add a retry_llist that maintains a list of local tw that couldn't be
>> done in time. No synchronisation is needed since it is only modified
>> within the task context.
>>
>> Signed-off-by: David Wei <[email protected]>
>> ---
>>   include/linux/io_uring_types.h |  1 +
>>   io_uring/io_uring.c            | 43 +++++++++++++++++++++++++---------
>>   io_uring/io_uring.h            |  2 +-
>>   3 files changed, 34 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index 593c10a02144..011860ade268 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -336,6 +336,7 @@ struct io_ring_ctx {
>>        */
>>       struct {
>>           struct llist_head    work_llist;
>> +        struct llist_head    retry_llist;
> 
> Fwiw, probably doesn't matter, but it doesn't even need
> to be atomic, it's queued and spliced while holding
> ->uring_lock, the pending check is also synchronised as
> there is only one possible task doing that.

Yeah, it doesn't. Keeping it as a llist_head means being able to re-use
helpers that take llist_head or llist_node.

> 
>>           unsigned long        check_cq;
>>           atomic_t        cq_wait_nr;
>>           atomic_t        cq_timeouts;
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 83bf041d2648..c3a7d0197636 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -121,6 +121,7 @@
> ...
>>   static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts,
>>                      int min_events)
>>   {
>>       struct llist_node *node;
>>       unsigned int loops = 0;
>> -    int ret = 0;
>> +    int ret, limit;
>>         if (WARN_ON_ONCE(ctx->submitter_task != current))
>>           return -EEXIST;
>>       if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
>>           atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
>> +    limit = max(IO_LOCAL_TW_DEFAULT_MAX, min_events);
>>   again:
>> +    ret = __io_run_local_work_loop(&ctx->retry_llist.first, ts, limit);
>> +    if (ctx->retry_llist.first)
>> +        goto retry_done;
>> +
>>       /*
>>        * llists are in reverse order, flip it back the right way before
>>        * running the pending items.
>>        */
>>       node = llist_reverse_order(llist_del_all(&ctx->work_llist));
>> -    while (node) {
>> -        struct llist_node *next = node->next;
>> -        struct io_kiocb *req = container_of(node, struct io_kiocb,
>> -                            io_task_work.node);
>> -        INDIRECT_CALL_2(req->io_task_work.func,
>> -                io_poll_task_func, io_req_rw_complete,
>> -                req, ts);
>> -        ret++;
>> -        node = next;
>> -    }
>> +    ret = __io_run_local_work_loop(&node, ts, ret);
> 
> One thing that is not so nice is that now we have this handling and
> checks in the hot path, and __io_run_local_work_loop() most likely
> gets uninlined.
> 
> I wonder, can we just requeue it via task_work again? We can even
> add a variant efficiently adding a list instead of a single entry,
> i.e. local_task_work_add(head, tail, ...);

That was an early idea, but it means re-reversing the list and then
atomically adding each node back to work_llist concurrently with e.g.
io_req_local_work_add().

Using a separate retry_llist means we don't need to concurrently add to
either retry_llist or work_llist.

> 
> I'm also curious what's the use case you've got that is hitting
> the problem?
> 

There is a Memcache-like workload that has load shedding based on the
time spent doing work. With epoll, the work of reading sockets and
processing a request is done by user, which can decide after some amount
of time to drop the remaining work if it takes too long. With io_uring,
the work of reading sockets is done eagerly inside of task work. If
there is a burst of work, then so much time is spent in task work
reading from sockets that, by the time control returns to user the
timeout has already elapsed.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH next v1 2/2] io_uring: limit local tw done
  2024-11-20 23:56   ` Pavel Begunkov
  2024-11-21  0:52     ` David Wei
@ 2024-11-21  1:12     ` Jens Axboe
  1 sibling, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2024-11-21  1:12 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring

On 11/20/24 4:56 PM, Pavel Begunkov wrote:
> On 11/20/24 22:14, David Wei wrote:
>> Instead of eagerly running all available local tw, limit the amount of
>> local tw done to the max of IO_LOCAL_TW_DEFAULT_MAX (20) or wait_nr. The
>> value of 20 is chosen as a reasonable heuristic to allow enough work
>> batching but also keep latency down.
>>
>> Add a retry_llist that maintains a list of local tw that couldn't be
>> done in time. No synchronisation is needed since it is only modified
>> within the task context.
>>
>> Signed-off-by: David Wei <[email protected]>
>> ---
>>   include/linux/io_uring_types.h |  1 +
>>   io_uring/io_uring.c            | 43 +++++++++++++++++++++++++---------
>>   io_uring/io_uring.h            |  2 +-
>>   3 files changed, 34 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index 593c10a02144..011860ade268 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -336,6 +336,7 @@ struct io_ring_ctx {
>>        */
>>       struct {
>>           struct llist_head    work_llist;
>> +        struct llist_head    retry_llist;
> 
> Fwiw, probably doesn't matter, but it doesn't even need
> to be atomic, it's queued and spliced while holding
> ->uring_lock, the pending check is also synchronised as
> there is only one possible task doing that.
> 
>>           unsigned long        check_cq;
>>           atomic_t        cq_wait_nr;
>>           atomic_t        cq_timeouts;
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 83bf041d2648..c3a7d0197636 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -121,6 +121,7 @@
> ...
>>   static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts,
>>                      int min_events)
>>   {
>>       struct llist_node *node;
>>       unsigned int loops = 0;
>> -    int ret = 0;
>> +    int ret, limit;
>>         if (WARN_ON_ONCE(ctx->submitter_task != current))
>>           return -EEXIST;
>>       if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
>>           atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
>> +    limit = max(IO_LOCAL_TW_DEFAULT_MAX, min_events);
>>   again:
>> +    ret = __io_run_local_work_loop(&ctx->retry_llist.first, ts, limit);
>> +    if (ctx->retry_llist.first)
>> +        goto retry_done;
>> +
>>       /*
>>        * llists are in reverse order, flip it back the right way before
>>        * running the pending items.
>>        */
>>       node = llist_reverse_order(llist_del_all(&ctx->work_llist));
>> -    while (node) {
>> -        struct llist_node *next = node->next;
>> -        struct io_kiocb *req = container_of(node, struct io_kiocb,
>> -                            io_task_work.node);
>> -        INDIRECT_CALL_2(req->io_task_work.func,
>> -                io_poll_task_func, io_req_rw_complete,
>> -                req, ts);
>> -        ret++;
>> -        node = next;
>> -    }
>> +    ret = __io_run_local_work_loop(&node, ts, ret);
> 
> One thing that is not so nice is that now we have this handling and
> checks in the hot path, and __io_run_local_work_loop() most likely
> gets uninlined.

I don't think that really matters, it's pretty light. The main overhead
in this function is not the call, it's reordering requests and touching
cachelines of the requests.

I think it's pretty light as-is and actually looks pretty good. It's
also similar to how sqpoll bites over longer task_work lines, and
arguably a mistake that we allow huge depths of this when we can avoid
it with deferred task_work.

> I wonder, can we just requeue it via task_work again? We can even
> add a variant efficiently adding a list instead of a single entry,
> i.e. local_task_work_add(head, tail, ...);

I think that can only work if we change work_llist to be a regular list
with regular locking. Otherwise it's a bit of a mess with the list being
reordered, and then you're spending extra cycles on potentially
reordering all the entries again.

> I'm also curious what's the use case you've got that is hitting
> the problem?

I'll let David answer that one, but some task_work can take a while to
run, eg if it's not just posting a completion.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH next v1 0/2] limit local tw done
  2024-11-20 22:14 [PATCH next v1 0/2] limit local tw done David Wei
  2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei
  2024-11-20 22:14 ` [PATCH next v1 2/2] io_uring: limit local tw done David Wei
@ 2024-11-21  1:12 ` Jens Axboe
  2 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2024-11-21  1:12 UTC (permalink / raw)
  To: David Wei, io-uring; +Cc: Pavel Begunkov

On 11/20/24 3:14 PM, David Wei wrote:
> Currently when running local tw it will eagerly try and complete all
> available work. When there is a burst of work coming in, the list of
> work in work_llist may be much larger than the requested batch count
> wait_nr. Doing all of the work eagerly may cause latency issues for some
> applications that do not want to spend so much time in the kernel.
> 
> Limit the amount of local tw done to the max of 20 (a heuristic) or
> wait_nr. This then does not require any userspace changes.
> 
> Many applications will submit and wait with wait_nr = 1 to mean "wait
> for _at least_ 1 completion, but do more work if available". This used
> to mean "all work" but now that is capped to 20 requests. If users want
> more work batching, then they can set wait_nr to > 20.

Looks good to me!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-11-21  1:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-20 22:14 [PATCH next v1 0/2] limit local tw done David Wei
2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei
2024-11-20 23:45   ` Pavel Begunkov
2024-11-20 22:14 ` [PATCH next v1 2/2] io_uring: limit local tw done David Wei
2024-11-20 23:56   ` Pavel Begunkov
2024-11-21  0:52     ` David Wei
2024-11-21  1:12     ` Jens Axboe
2024-11-21  1:12 ` [PATCH next v1 0/2] " Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox