* [PATCH next v1 0/2] limit local tw done @ 2024-11-20 22:14 David Wei 2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei ` (3 more replies) 0 siblings, 4 replies; 30+ messages in thread From: David Wei @ 2024-11-20 22:14 UTC (permalink / raw) To: io-uring; +Cc: David Wei, Jens Axboe, Pavel Begunkov Currently when running local tw it will eagerly try and complete all available work. When there is a burst of work coming in, the list of work in work_llist may be much larger than the requested batch count wait_nr. Doing all of the work eagerly may cause latency issues for some applications that do not want to spend so much time in the kernel. Limit the amount of local tw done to the max of 20 (a heuristic) or wait_nr. This then does not require any userspace changes. Many applications will submit and wait with wait_nr = 1 to mean "wait for _at least_ 1 completion, but do more work if available". This used to mean "all work" but now that is capped to 20 requests. If users want more work batching, then they can set wait_nr to > 20. David Wei (2): io_uring: add io_local_work_pending() io_uring: limit local tw done include/linux/io_uring_types.h | 1 + io_uring/io_uring.c | 57 +++++++++++++++++++++++----------- io_uring/io_uring.h | 9 ++++-- 3 files changed, 47 insertions(+), 20 deletions(-) -- 2.43.5 ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH next v1 1/2] io_uring: add io_local_work_pending() 2024-11-20 22:14 [PATCH next v1 0/2] limit local tw done David Wei @ 2024-11-20 22:14 ` David Wei 2024-11-20 23:45 ` Pavel Begunkov 2024-11-20 22:14 ` [PATCH next v1 2/2] io_uring: limit local tw done David Wei ` (2 subsequent siblings) 3 siblings, 1 reply; 30+ messages in thread From: David Wei @ 2024-11-20 22:14 UTC (permalink / raw) To: io-uring; +Cc: David Wei, Jens Axboe, Pavel Begunkov In preparation for adding a new llist of tw to retry due to hitting the tw limit, add a helper io_local_work_pending(). This function returns true if there is any local tw pending. For now it only checks ctx->work_llist. Signed-off-by: David Wei <[email protected]> --- io_uring/io_uring.c | 14 +++++++------- io_uring/io_uring.h | 9 +++++++-- 2 files changed, 14 insertions(+), 9 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 801293399883..83bf041d2648 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -1260,7 +1260,7 @@ static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx) static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events, int min_events) { - if (llist_empty(&ctx->work_llist)) + if (!io_local_work_pending(ctx)) return false; if (events < min_events) return true; @@ -1313,7 +1313,7 @@ static inline int io_run_local_work_locked(struct io_ring_ctx *ctx, { struct io_tw_state ts = {}; - if (llist_empty(&ctx->work_llist)) + if (!io_local_work_pending(ctx)) return 0; return __io_run_local_work(ctx, &ts, min_events); } @@ -2328,7 +2328,7 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, int io_run_task_work_sig(struct io_ring_ctx *ctx) { - if (!llist_empty(&ctx->work_llist)) { + if (io_local_work_pending(ctx)) { __set_current_state(TASK_RUNNING); if (io_run_local_work(ctx, INT_MAX) > 0) return 0; @@ -2459,7 +2459,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, { if (unlikely(READ_ONCE(ctx->check_cq))) return 1; - if (unlikely(!llist_empty(&ctx->work_llist))) + if (unlikely(io_local_work_pending(ctx))) return 1; if (unlikely(task_work_pending(current))) return 1; @@ -2493,7 +2493,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags, if (!io_allowed_run_tw(ctx)) return -EEXIST; - if (!llist_empty(&ctx->work_llist)) + if (io_local_work_pending(ctx)) io_run_local_work(ctx, min_events); io_run_task_work(); @@ -2564,7 +2564,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags, * If we got woken because of task_work being processed, run it * now rather than let the caller do another wait loop. */ - if (!llist_empty(&ctx->work_llist)) + if (io_local_work_pending(ctx)) io_run_local_work(ctx, nr_wait); io_run_task_work(); @@ -3158,7 +3158,7 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd) io_run_task_work(); io_uring_drop_tctx_refs(current); xa_for_each(&tctx->xa, index, node) { - if (!llist_empty(&node->ctx->work_llist)) { + if (io_local_work_pending(node->ctx)) { WARN_ON_ONCE(node->ctx->submitter_task && node->ctx->submitter_task != current); goto end_wait; diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 4070d4c8ef97..69eb3b23a5a0 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -347,9 +347,14 @@ static inline int io_run_task_work(void) return ret; } +static inline bool io_local_work_pending(struct io_ring_ctx *ctx) +{ + return !llist_empty(&ctx->work_llist); +} + static inline bool io_task_work_pending(struct io_ring_ctx *ctx) { - return task_work_pending(current) || !llist_empty(&ctx->work_llist); + return task_work_pending(current) || io_local_work_pending(ctx); } static inline void io_tw_lock(struct io_ring_ctx *ctx, struct io_tw_state *ts) @@ -484,6 +489,6 @@ enum { static inline bool io_has_work(struct io_ring_ctx *ctx) { return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq) || - !llist_empty(&ctx->work_llist); + io_local_work_pending(ctx); } #endif -- 2.43.5 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 1/2] io_uring: add io_local_work_pending() 2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei @ 2024-11-20 23:45 ` Pavel Begunkov 0 siblings, 0 replies; 30+ messages in thread From: Pavel Begunkov @ 2024-11-20 23:45 UTC (permalink / raw) To: David Wei, io-uring; +Cc: Jens Axboe On 11/20/24 22:14, David Wei wrote: > In preparation for adding a new llist of tw to retry due to hitting the > tw limit, add a helper io_local_work_pending(). This function returns > true if there is any local tw pending. For now it only checks > ctx->work_llist. Looks clean, we can even take it separately from 2/2 Reviewed-by: Pavel Begunkov <[email protected]> -- Pavel Begunkov ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-20 22:14 [PATCH next v1 0/2] limit local tw done David Wei 2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei @ 2024-11-20 22:14 ` David Wei 2024-11-20 23:56 ` Pavel Begunkov 2024-11-21 1:12 ` [PATCH next v1 0/2] " Jens Axboe 2024-11-21 14:16 ` Jens Axboe 3 siblings, 1 reply; 30+ messages in thread From: David Wei @ 2024-11-20 22:14 UTC (permalink / raw) To: io-uring; +Cc: David Wei, Jens Axboe, Pavel Begunkov Instead of eagerly running all available local tw, limit the amount of local tw done to the max of IO_LOCAL_TW_DEFAULT_MAX (20) or wait_nr. The value of 20 is chosen as a reasonable heuristic to allow enough work batching but also keep latency down. Add a retry_llist that maintains a list of local tw that couldn't be done in time. No synchronisation is needed since it is only modified within the task context. Signed-off-by: David Wei <[email protected]> --- include/linux/io_uring_types.h | 1 + io_uring/io_uring.c | 43 +++++++++++++++++++++++++--------- io_uring/io_uring.h | 2 +- 3 files changed, 34 insertions(+), 12 deletions(-) diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 593c10a02144..011860ade268 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -336,6 +336,7 @@ struct io_ring_ctx { */ struct { struct llist_head work_llist; + struct llist_head retry_llist; unsigned long check_cq; atomic_t cq_wait_nr; atomic_t cq_timeouts; diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 83bf041d2648..c3a7d0197636 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -121,6 +121,7 @@ #define IO_COMPL_BATCH 32 #define IO_REQ_ALLOC_BATCH 8 +#define IO_LOCAL_TW_DEFAULT_MAX 20 struct io_defer_entry { struct list_head list; @@ -1255,6 +1256,8 @@ static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx) struct llist_node *node = llist_del_all(&ctx->work_llist); __io_fallback_tw(node, false); + node = llist_del_all(&ctx->retry_llist); + __io_fallback_tw(node, false); } static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events, @@ -1269,37 +1272,55 @@ static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events, return false; } +static int __io_run_local_work_loop(struct llist_node **node, + struct io_tw_state *ts, + int events) +{ + while (*node) { + struct llist_node *next = (*node)->next; + struct io_kiocb *req = container_of(*node, struct io_kiocb, + io_task_work.node); + INDIRECT_CALL_2(req->io_task_work.func, + io_poll_task_func, io_req_rw_complete, + req, ts); + *node = next; + if (--events <= 0) + break; + } + + return events; +} + static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts, int min_events) { struct llist_node *node; unsigned int loops = 0; - int ret = 0; + int ret, limit; if (WARN_ON_ONCE(ctx->submitter_task != current)) return -EEXIST; if (ctx->flags & IORING_SETUP_TASKRUN_FLAG) atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags); + limit = max(IO_LOCAL_TW_DEFAULT_MAX, min_events); again: + ret = __io_run_local_work_loop(&ctx->retry_llist.first, ts, limit); + if (ctx->retry_llist.first) + goto retry_done; + /* * llists are in reverse order, flip it back the right way before * running the pending items. */ node = llist_reverse_order(llist_del_all(&ctx->work_llist)); - while (node) { - struct llist_node *next = node->next; - struct io_kiocb *req = container_of(node, struct io_kiocb, - io_task_work.node); - INDIRECT_CALL_2(req->io_task_work.func, - io_poll_task_func, io_req_rw_complete, - req, ts); - ret++; - node = next; - } + ret = __io_run_local_work_loop(&node, ts, ret); + ctx->retry_llist.first = node; loops++; + ret = limit - ret; if (io_run_local_work_continue(ctx, ret, min_events)) goto again; +retry_done: io_submit_flush_completions(ctx); if (io_run_local_work_continue(ctx, ret, min_events)) goto again; diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 69eb3b23a5a0..12abee607e4a 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -349,7 +349,7 @@ static inline int io_run_task_work(void) static inline bool io_local_work_pending(struct io_ring_ctx *ctx) { - return !llist_empty(&ctx->work_llist); + return !llist_empty(&ctx->work_llist) || !llist_empty(&ctx->retry_llist); } static inline bool io_task_work_pending(struct io_ring_ctx *ctx) -- 2.43.5 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-20 22:14 ` [PATCH next v1 2/2] io_uring: limit local tw done David Wei @ 2024-11-20 23:56 ` Pavel Begunkov 2024-11-21 0:52 ` David Wei 2024-11-21 1:12 ` Jens Axboe 0 siblings, 2 replies; 30+ messages in thread From: Pavel Begunkov @ 2024-11-20 23:56 UTC (permalink / raw) To: David Wei, io-uring; +Cc: Jens Axboe On 11/20/24 22:14, David Wei wrote: > Instead of eagerly running all available local tw, limit the amount of > local tw done to the max of IO_LOCAL_TW_DEFAULT_MAX (20) or wait_nr. The > value of 20 is chosen as a reasonable heuristic to allow enough work > batching but also keep latency down. > > Add a retry_llist that maintains a list of local tw that couldn't be > done in time. No synchronisation is needed since it is only modified > within the task context. > > Signed-off-by: David Wei <[email protected]> > --- > include/linux/io_uring_types.h | 1 + > io_uring/io_uring.c | 43 +++++++++++++++++++++++++--------- > io_uring/io_uring.h | 2 +- > 3 files changed, 34 insertions(+), 12 deletions(-) > > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h > index 593c10a02144..011860ade268 100644 > --- a/include/linux/io_uring_types.h > +++ b/include/linux/io_uring_types.h > @@ -336,6 +336,7 @@ struct io_ring_ctx { > */ > struct { > struct llist_head work_llist; > + struct llist_head retry_llist; Fwiw, probably doesn't matter, but it doesn't even need to be atomic, it's queued and spliced while holding ->uring_lock, the pending check is also synchronised as there is only one possible task doing that. > unsigned long check_cq; > atomic_t cq_wait_nr; > atomic_t cq_timeouts; > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 83bf041d2648..c3a7d0197636 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -121,6 +121,7 @@ ... > static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts, > int min_events) > { > struct llist_node *node; > unsigned int loops = 0; > - int ret = 0; > + int ret, limit; > > if (WARN_ON_ONCE(ctx->submitter_task != current)) > return -EEXIST; > if (ctx->flags & IORING_SETUP_TASKRUN_FLAG) > atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags); > + limit = max(IO_LOCAL_TW_DEFAULT_MAX, min_events); > again: > + ret = __io_run_local_work_loop(&ctx->retry_llist.first, ts, limit); > + if (ctx->retry_llist.first) > + goto retry_done; > + > /* > * llists are in reverse order, flip it back the right way before > * running the pending items. > */ > node = llist_reverse_order(llist_del_all(&ctx->work_llist)); > - while (node) { > - struct llist_node *next = node->next; > - struct io_kiocb *req = container_of(node, struct io_kiocb, > - io_task_work.node); > - INDIRECT_CALL_2(req->io_task_work.func, > - io_poll_task_func, io_req_rw_complete, > - req, ts); > - ret++; > - node = next; > - } > + ret = __io_run_local_work_loop(&node, ts, ret); One thing that is not so nice is that now we have this handling and checks in the hot path, and __io_run_local_work_loop() most likely gets uninlined. I wonder, can we just requeue it via task_work again? We can even add a variant efficiently adding a list instead of a single entry, i.e. local_task_work_add(head, tail, ...); I'm also curious what's the use case you've got that is hitting the problem? -- Pavel Begunkov ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-20 23:56 ` Pavel Begunkov @ 2024-11-21 0:52 ` David Wei 2024-11-21 14:29 ` Pavel Begunkov 2024-11-21 1:12 ` Jens Axboe 1 sibling, 1 reply; 30+ messages in thread From: David Wei @ 2024-11-21 0:52 UTC (permalink / raw) To: Pavel Begunkov, io-uring; +Cc: Jens Axboe On 2024-11-20 15:56, Pavel Begunkov wrote: > On 11/20/24 22:14, David Wei wrote: >> Instead of eagerly running all available local tw, limit the amount of >> local tw done to the max of IO_LOCAL_TW_DEFAULT_MAX (20) or wait_nr. The >> value of 20 is chosen as a reasonable heuristic to allow enough work >> batching but also keep latency down. >> >> Add a retry_llist that maintains a list of local tw that couldn't be >> done in time. No synchronisation is needed since it is only modified >> within the task context. >> >> Signed-off-by: David Wei <[email protected]> >> --- >> include/linux/io_uring_types.h | 1 + >> io_uring/io_uring.c | 43 +++++++++++++++++++++++++--------- >> io_uring/io_uring.h | 2 +- >> 3 files changed, 34 insertions(+), 12 deletions(-) >> >> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h >> index 593c10a02144..011860ade268 100644 >> --- a/include/linux/io_uring_types.h >> +++ b/include/linux/io_uring_types.h >> @@ -336,6 +336,7 @@ struct io_ring_ctx { >> */ >> struct { >> struct llist_head work_llist; >> + struct llist_head retry_llist; > > Fwiw, probably doesn't matter, but it doesn't even need > to be atomic, it's queued and spliced while holding > ->uring_lock, the pending check is also synchronised as > there is only one possible task doing that. Yeah, it doesn't. Keeping it as a llist_head means being able to re-use helpers that take llist_head or llist_node. > >> unsigned long check_cq; >> atomic_t cq_wait_nr; >> atomic_t cq_timeouts; >> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >> index 83bf041d2648..c3a7d0197636 100644 >> --- a/io_uring/io_uring.c >> +++ b/io_uring/io_uring.c >> @@ -121,6 +121,7 @@ > ... >> static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts, >> int min_events) >> { >> struct llist_node *node; >> unsigned int loops = 0; >> - int ret = 0; >> + int ret, limit; >> if (WARN_ON_ONCE(ctx->submitter_task != current)) >> return -EEXIST; >> if (ctx->flags & IORING_SETUP_TASKRUN_FLAG) >> atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags); >> + limit = max(IO_LOCAL_TW_DEFAULT_MAX, min_events); >> again: >> + ret = __io_run_local_work_loop(&ctx->retry_llist.first, ts, limit); >> + if (ctx->retry_llist.first) >> + goto retry_done; >> + >> /* >> * llists are in reverse order, flip it back the right way before >> * running the pending items. >> */ >> node = llist_reverse_order(llist_del_all(&ctx->work_llist)); >> - while (node) { >> - struct llist_node *next = node->next; >> - struct io_kiocb *req = container_of(node, struct io_kiocb, >> - io_task_work.node); >> - INDIRECT_CALL_2(req->io_task_work.func, >> - io_poll_task_func, io_req_rw_complete, >> - req, ts); >> - ret++; >> - node = next; >> - } >> + ret = __io_run_local_work_loop(&node, ts, ret); > > One thing that is not so nice is that now we have this handling and > checks in the hot path, and __io_run_local_work_loop() most likely > gets uninlined. > > I wonder, can we just requeue it via task_work again? We can even > add a variant efficiently adding a list instead of a single entry, > i.e. local_task_work_add(head, tail, ...); That was an early idea, but it means re-reversing the list and then atomically adding each node back to work_llist concurrently with e.g. io_req_local_work_add(). Using a separate retry_llist means we don't need to concurrently add to either retry_llist or work_llist. > > I'm also curious what's the use case you've got that is hitting > the problem? > There is a Memcache-like workload that has load shedding based on the time spent doing work. With epoll, the work of reading sockets and processing a request is done by user, which can decide after some amount of time to drop the remaining work if it takes too long. With io_uring, the work of reading sockets is done eagerly inside of task work. If there is a burst of work, then so much time is spent in task work reading from sockets that, by the time control returns to user the timeout has already elapsed. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 0:52 ` David Wei @ 2024-11-21 14:29 ` Pavel Begunkov 2024-11-21 14:34 ` Jens Axboe 0 siblings, 1 reply; 30+ messages in thread From: Pavel Begunkov @ 2024-11-21 14:29 UTC (permalink / raw) To: David Wei, io-uring; +Cc: Jens Axboe On 11/21/24 00:52, David Wei wrote: > On 2024-11-20 15:56, Pavel Begunkov wrote: >> On 11/20/24 22:14, David Wei wrote: ... >> One thing that is not so nice is that now we have this handling and >> checks in the hot path, and __io_run_local_work_loop() most likely >> gets uninlined. >> >> I wonder, can we just requeue it via task_work again? We can even >> add a variant efficiently adding a list instead of a single entry, >> i.e. local_task_work_add(head, tail, ...); > > That was an early idea, but it means re-reversing the list and then > atomically adding each node back to work_llist concurrently with e.g. > io_req_local_work_add(). > > Using a separate retry_llist means we don't need to concurrently add to > either retry_llist or work_llist. > >> >> I'm also curious what's the use case you've got that is hitting >> the problem? >> > > There is a Memcache-like workload that has load shedding based on the > time spent doing work. With epoll, the work of reading sockets and > processing a request is done by user, which can decide after some amount > of time to drop the remaining work if it takes too long. With io_uring, > the work of reading sockets is done eagerly inside of task work. If > there is a burst of work, then so much time is spent in task work > reading from sockets that, by the time control returns to user the > timeout has already elapsed. Interesting, it also sounds like instead of an arbitrary 20 we might want the user to feed it to us. Might be easier to do it with the bpf toy not to carve another argument. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 14:29 ` Pavel Begunkov @ 2024-11-21 14:34 ` Jens Axboe 2024-11-21 14:58 ` Pavel Begunkov 0 siblings, 1 reply; 30+ messages in thread From: Jens Axboe @ 2024-11-21 14:34 UTC (permalink / raw) To: Pavel Begunkov, David Wei, io-uring On 11/21/24 7:29 AM, Pavel Begunkov wrote: > On 11/21/24 00:52, David Wei wrote: >> On 2024-11-20 15:56, Pavel Begunkov wrote: >>> On 11/20/24 22:14, David Wei wrote: > ... >>> One thing that is not so nice is that now we have this handling and >>> checks in the hot path, and __io_run_local_work_loop() most likely >>> gets uninlined. >>> >>> I wonder, can we just requeue it via task_work again? We can even >>> add a variant efficiently adding a list instead of a single entry, >>> i.e. local_task_work_add(head, tail, ...); >> >> That was an early idea, but it means re-reversing the list and then >> atomically adding each node back to work_llist concurrently with e.g. >> io_req_local_work_add(). >> >> Using a separate retry_llist means we don't need to concurrently add to >> either retry_llist or work_llist. >> >>> >>> I'm also curious what's the use case you've got that is hitting >>> the problem? >>> >> >> There is a Memcache-like workload that has load shedding based on the >> time spent doing work. With epoll, the work of reading sockets and >> processing a request is done by user, which can decide after some amount >> of time to drop the remaining work if it takes too long. With io_uring, >> the work of reading sockets is done eagerly inside of task work. If >> there is a burst of work, then so much time is spent in task work >> reading from sockets that, by the time control returns to user the >> timeout has already elapsed. > > Interesting, it also sounds like instead of an arbitrary 20 we > might want the user to feed it to us. Might be easier to do it > with the bpf toy not to carve another argument. David and I did discuss that, and I was not in favor of having an extra argument. We really just need some kind of limit to prevent it over-running. Arguably that should always be min_events, which we already have, but that kind of runs afoul of applications just doing io_uring_wait_cqe() and hence asking for 1. That's why the hand wavy number exists, which is really no different than other hand wavy numbers we have to limit running of "something" - eg other kinds of retries. Adding another argument to this just again doubles wait logic complexity in terms of the API. If it's needed down the line for whatever reason, then yeah we can certainly do it, probably via the wait regions. But adding it to the generic wait path would be a mistake imho. I also strongly suggest this is the last we'll ever hear of this, and for that reason alone I don't think it's worth any kind of extra arguments or added complexity. -- Jens Axboe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 14:34 ` Jens Axboe @ 2024-11-21 14:58 ` Pavel Begunkov 2024-11-21 15:02 ` Jens Axboe 0 siblings, 1 reply; 30+ messages in thread From: Pavel Begunkov @ 2024-11-21 14:58 UTC (permalink / raw) To: Jens Axboe, David Wei, io-uring On 11/21/24 14:34, Jens Axboe wrote: > On 11/21/24 7:29 AM, Pavel Begunkov wrote: >> On 11/21/24 00:52, David Wei wrote: >>> On 2024-11-20 15:56, Pavel Begunkov wrote: >>>> On 11/20/24 22:14, David Wei wrote: ... >>> There is a Memcache-like workload that has load shedding based on the >>> time spent doing work. With epoll, the work of reading sockets and >>> processing a request is done by user, which can decide after some amount >>> of time to drop the remaining work if it takes too long. With io_uring, >>> the work of reading sockets is done eagerly inside of task work. If >>> there is a burst of work, then so much time is spent in task work >>> reading from sockets that, by the time control returns to user the >>> timeout has already elapsed. >> >> Interesting, it also sounds like instead of an arbitrary 20 we >> might want the user to feed it to us. Might be easier to do it >> with the bpf toy not to carve another argument. > > David and I did discuss that, and I was not in favor of having an extra > argument. We really just need some kind of limit to prevent it > over-running. Arguably that should always be min_events, which we > already have, but that kind of runs afoul of applications just doing > io_uring_wait_cqe() and hence asking for 1. That's why the hand wavy > number exists, which is really no different than other hand wavy numbers > we have to limit running of "something" - eg other kinds of retries. > > Adding another argument to this just again doubles wait logic complexity > in terms of the API. If it's needed down the line for whatever reason, > then yeah we can certainly do it, probably via the wait regions. But > adding it to the generic wait path would be a mistake imho. Right, I don't like the idea of a wait argument either, messy and too advanced of a tuning for it. BPF would be fine as it could be made as a hint and easily removed if needed. It could also be a per ring REGISTER based hint, a bit worse but with the same deprecation argument. Obviously would need a good user / use case first. > I also strongly suggest this is the last we'll ever hear of this, and > for that reason alone I don't think it's worth any kind of extra > arguments or added complexity. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 14:58 ` Pavel Begunkov @ 2024-11-21 15:02 ` Jens Axboe 0 siblings, 0 replies; 30+ messages in thread From: Jens Axboe @ 2024-11-21 15:02 UTC (permalink / raw) To: Pavel Begunkov, David Wei, io-uring On 11/21/24 7:58 AM, Pavel Begunkov wrote: > On 11/21/24 14:34, Jens Axboe wrote: >> On 11/21/24 7:29 AM, Pavel Begunkov wrote: >>> On 11/21/24 00:52, David Wei wrote: >>>> On 2024-11-20 15:56, Pavel Begunkov wrote: >>>>> On 11/20/24 22:14, David Wei wrote: > ... >>>> There is a Memcache-like workload that has load shedding based on the >>>> time spent doing work. With epoll, the work of reading sockets and >>>> processing a request is done by user, which can decide after some amount >>>> of time to drop the remaining work if it takes too long. With io_uring, >>>> the work of reading sockets is done eagerly inside of task work. If >>>> there is a burst of work, then so much time is spent in task work >>>> reading from sockets that, by the time control returns to user the >>>> timeout has already elapsed. >>> >>> Interesting, it also sounds like instead of an arbitrary 20 we >>> might want the user to feed it to us. Might be easier to do it >>> with the bpf toy not to carve another argument. >> >> David and I did discuss that, and I was not in favor of having an extra >> argument. We really just need some kind of limit to prevent it >> over-running. Arguably that should always be min_events, which we >> already have, but that kind of runs afoul of applications just doing >> io_uring_wait_cqe() and hence asking for 1. That's why the hand wavy >> number exists, which is really no different than other hand wavy numbers >> we have to limit running of "something" - eg other kinds of retries. >> >> Adding another argument to this just again doubles wait logic complexity >> in terms of the API. If it's needed down the line for whatever reason, >> then yeah we can certainly do it, probably via the wait regions. But >> adding it to the generic wait path would be a mistake imho. > > Right, I don't like the idea of a wait argument either, messy and > too advanced of a tuning for it. BPF would be fine as it could be > made as a hint and easily removed if needed. It could also be a per > ring REGISTER based hint, a bit worse but with the same deprecation > argument. Obviously would need a good user / use case first. Sure I agree on that, if you already have BPF integration for other things, then yeah you could add tighter control for it. Even with that, I doubt it'd be useful or meaningful really. Current case seems like a worst case kind of thing, recv task_work is arguably one of the more expensive things that can be done out of task_work. We can revisit down the line if it ever becomes needed. -- Jens Axboe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-20 23:56 ` Pavel Begunkov 2024-11-21 0:52 ` David Wei @ 2024-11-21 1:12 ` Jens Axboe 2024-11-21 14:25 ` Pavel Begunkov 1 sibling, 1 reply; 30+ messages in thread From: Jens Axboe @ 2024-11-21 1:12 UTC (permalink / raw) To: Pavel Begunkov, David Wei, io-uring On 11/20/24 4:56 PM, Pavel Begunkov wrote: > On 11/20/24 22:14, David Wei wrote: >> Instead of eagerly running all available local tw, limit the amount of >> local tw done to the max of IO_LOCAL_TW_DEFAULT_MAX (20) or wait_nr. The >> value of 20 is chosen as a reasonable heuristic to allow enough work >> batching but also keep latency down. >> >> Add a retry_llist that maintains a list of local tw that couldn't be >> done in time. No synchronisation is needed since it is only modified >> within the task context. >> >> Signed-off-by: David Wei <[email protected]> >> --- >> include/linux/io_uring_types.h | 1 + >> io_uring/io_uring.c | 43 +++++++++++++++++++++++++--------- >> io_uring/io_uring.h | 2 +- >> 3 files changed, 34 insertions(+), 12 deletions(-) >> >> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h >> index 593c10a02144..011860ade268 100644 >> --- a/include/linux/io_uring_types.h >> +++ b/include/linux/io_uring_types.h >> @@ -336,6 +336,7 @@ struct io_ring_ctx { >> */ >> struct { >> struct llist_head work_llist; >> + struct llist_head retry_llist; > > Fwiw, probably doesn't matter, but it doesn't even need > to be atomic, it's queued and spliced while holding > ->uring_lock, the pending check is also synchronised as > there is only one possible task doing that. > >> unsigned long check_cq; >> atomic_t cq_wait_nr; >> atomic_t cq_timeouts; >> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >> index 83bf041d2648..c3a7d0197636 100644 >> --- a/io_uring/io_uring.c >> +++ b/io_uring/io_uring.c >> @@ -121,6 +121,7 @@ > ... >> static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts, >> int min_events) >> { >> struct llist_node *node; >> unsigned int loops = 0; >> - int ret = 0; >> + int ret, limit; >> if (WARN_ON_ONCE(ctx->submitter_task != current)) >> return -EEXIST; >> if (ctx->flags & IORING_SETUP_TASKRUN_FLAG) >> atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags); >> + limit = max(IO_LOCAL_TW_DEFAULT_MAX, min_events); >> again: >> + ret = __io_run_local_work_loop(&ctx->retry_llist.first, ts, limit); >> + if (ctx->retry_llist.first) >> + goto retry_done; >> + >> /* >> * llists are in reverse order, flip it back the right way before >> * running the pending items. >> */ >> node = llist_reverse_order(llist_del_all(&ctx->work_llist)); >> - while (node) { >> - struct llist_node *next = node->next; >> - struct io_kiocb *req = container_of(node, struct io_kiocb, >> - io_task_work.node); >> - INDIRECT_CALL_2(req->io_task_work.func, >> - io_poll_task_func, io_req_rw_complete, >> - req, ts); >> - ret++; >> - node = next; >> - } >> + ret = __io_run_local_work_loop(&node, ts, ret); > > One thing that is not so nice is that now we have this handling and > checks in the hot path, and __io_run_local_work_loop() most likely > gets uninlined. I don't think that really matters, it's pretty light. The main overhead in this function is not the call, it's reordering requests and touching cachelines of the requests. I think it's pretty light as-is and actually looks pretty good. It's also similar to how sqpoll bites over longer task_work lines, and arguably a mistake that we allow huge depths of this when we can avoid it with deferred task_work. > I wonder, can we just requeue it via task_work again? We can even > add a variant efficiently adding a list instead of a single entry, > i.e. local_task_work_add(head, tail, ...); I think that can only work if we change work_llist to be a regular list with regular locking. Otherwise it's a bit of a mess with the list being reordered, and then you're spending extra cycles on potentially reordering all the entries again. > I'm also curious what's the use case you've got that is hitting > the problem? I'll let David answer that one, but some task_work can take a while to run, eg if it's not just posting a completion. -- Jens Axboe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 1:12 ` Jens Axboe @ 2024-11-21 14:25 ` Pavel Begunkov 2024-11-21 14:31 ` Jens Axboe 0 siblings, 1 reply; 30+ messages in thread From: Pavel Begunkov @ 2024-11-21 14:25 UTC (permalink / raw) To: Jens Axboe, David Wei, io-uring On 11/21/24 01:12, Jens Axboe wrote: > On 11/20/24 4:56 PM, Pavel Begunkov wrote: >> On 11/20/24 22:14, David Wei wrote: ... >> One thing that is not so nice is that now we have this handling and >> checks in the hot path, and __io_run_local_work_loop() most likely >> gets uninlined. > > I don't think that really matters, it's pretty light. The main overhead > in this function is not the call, it's reordering requests and touching > cachelines of the requests. > > I think it's pretty light as-is and actually looks pretty good. It's It could be light, but the question is importance / frequency of the new path. If it only happens rarely but affects a high 9, then it'd more sense to optimise it from the common path. > also similar to how sqpoll bites over longer task_work lines, and > arguably a mistake that we allow huge depths of this when we can avoid > it with deferred task_work. > >> I wonder, can we just requeue it via task_work again? We can even >> add a variant efficiently adding a list instead of a single entry, >> i.e. local_task_work_add(head, tail, ...); > > I think that can only work if we change work_llist to be a regular list > with regular locking. Otherwise it's a bit of a mess with the list being Dylan once measured the overhead of locks vs atomics in this path for some artificial case, we can pull the numbers up. > reordered, and then you're spending extra cycles on potentially > reordering all the entries again. That sucks, I agree, but then it's same question of how often it happens. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 14:25 ` Pavel Begunkov @ 2024-11-21 14:31 ` Jens Axboe 2024-11-21 15:07 ` Pavel Begunkov 0 siblings, 1 reply; 30+ messages in thread From: Jens Axboe @ 2024-11-21 14:31 UTC (permalink / raw) To: Pavel Begunkov, David Wei, io-uring On 11/21/24 7:25 AM, Pavel Begunkov wrote: > On 11/21/24 01:12, Jens Axboe wrote: >> On 11/20/24 4:56 PM, Pavel Begunkov wrote: >>> On 11/20/24 22:14, David Wei wrote: > ... >>> One thing that is not so nice is that now we have this handling and >>> checks in the hot path, and __io_run_local_work_loop() most likely >>> gets uninlined. >> >> I don't think that really matters, it's pretty light. The main overhead >> in this function is not the call, it's reordering requests and touching >> cachelines of the requests. >> >> I think it's pretty light as-is and actually looks pretty good. It's > > It could be light, but the question is importance / frequency of > the new path. If it only happens rarely but affects a high 9, > then it'd more sense to optimise it from the common path. I'm more worried about the outlier cases. We don't generally expect this to trigger very much obviously, if long chains of task_work was the norm then we'd have other reports/issues related to that. But the common overhead here is really just checking if another (same cacheline) pointer is non-NULL, and ditto on the run side. Really don't think that's anything to worry about. >> also similar to how sqpoll bites over longer task_work lines, and >> arguably a mistake that we allow huge depths of this when we can avoid >> it with deferred task_work. >> >>> I wonder, can we just requeue it via task_work again? We can even >>> add a variant efficiently adding a list instead of a single entry, >>> i.e. local_task_work_add(head, tail, ...); >> >> I think that can only work if we change work_llist to be a regular list >> with regular locking. Otherwise it's a bit of a mess with the list being > > Dylan once measured the overhead of locks vs atomics in this > path for some artificial case, we can pull the numbers up. I did it more recently if you'll remember, actually posted a patch I think a few months ago changing it to that. But even that approach adds extra overhead, if you want to add it to the same list as now you need to re-grab (and re-disable interrupts) the lock to add it back. My gut says that would be _worse_ than the current approach. And if you keep a separate list instead, well then you're back to identical overhead in terms of now needing to check both when needing to know if anything is pending, and checking both when running it. >> reordered, and then you're spending extra cycles on potentially >> reordering all the entries again. > > That sucks, I agree, but then it's same question of how often > it happens. At least for now, there's a real issue reported and we should fix it. I think the current patches are fine in that regard. That doesn't mean we can't potentially make it better, we should certainly investigate that. But I don't see the current patches as being suboptimal really, they are definitely good enough as-is for solving the issue. -- Jens Axboe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 14:31 ` Jens Axboe @ 2024-11-21 15:07 ` Pavel Begunkov 2024-11-21 15:15 ` Jens Axboe 2024-11-21 17:53 ` David Wei 0 siblings, 2 replies; 30+ messages in thread From: Pavel Begunkov @ 2024-11-21 15:07 UTC (permalink / raw) To: Jens Axboe, David Wei, io-uring On 11/21/24 14:31, Jens Axboe wrote: > On 11/21/24 7:25 AM, Pavel Begunkov wrote: >> On 11/21/24 01:12, Jens Axboe wrote: >>> On 11/20/24 4:56 PM, Pavel Begunkov wrote: >>>> On 11/20/24 22:14, David Wei wrote: ... >>> I think that can only work if we change work_llist to be a regular list >>> with regular locking. Otherwise it's a bit of a mess with the list being >> >> Dylan once measured the overhead of locks vs atomics in this >> path for some artificial case, we can pull the numbers up. > > I did it more recently if you'll remember, actually posted a patch I > think a few months ago changing it to that. But even that approach adds Right, and it's be a separate topic from this set. > extra overhead, if you want to add it to the same list as now you need Extra overhead to the retry path, which is not the hot path, and coldness of it is uncertain. > to re-grab (and re-disable interrupts) the lock to add it back. My gut > says that would be _worse_ than the current approach. And if you keep a > separate list instead, well then you're back to identical overhead in > terms of now needing to check both when needing to know if anything is > pending, and checking both when running it. > >>> reordered, and then you're spending extra cycles on potentially >>> reordering all the entries again. >> >> That sucks, I agree, but then it's same question of how often >> it happens. > > At least for now, there's a real issue reported and we should fix it. I > think the current patches are fine in that regard. That doesn't mean we > can't potentially make it better, we should certainly investigate that. > But I don't see the current patches as being suboptimal really, they are > definitely good enough as-is for solving the issue. That's fair enough, but I still would love to know how frequent it is. There is no purpose in optimising it as hot/slow path if it triggers every fifth run or such. David, how easy it is to get some stats? We can hack up some bpftrace script -- Pavel Begunkov ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 15:07 ` Pavel Begunkov @ 2024-11-21 15:15 ` Jens Axboe 2024-11-21 15:22 ` Jens Axboe 2024-11-21 17:53 ` David Wei 1 sibling, 1 reply; 30+ messages in thread From: Jens Axboe @ 2024-11-21 15:15 UTC (permalink / raw) To: Pavel Begunkov, David Wei, io-uring On 11/21/24 8:07 AM, Pavel Begunkov wrote: >> At least for now, there's a real issue reported and we should fix it. I >> think the current patches are fine in that regard. That doesn't mean we >> can't potentially make it better, we should certainly investigate that. >> But I don't see the current patches as being suboptimal really, they are >> definitely good enough as-is for solving the issue. > > That's fair enough, but I still would love to know how frequent > it is. There is no purpose in optimising it as hot/slow path if > it triggers every fifth run or such. David, how easy it is to > get some stats? We can hack up some bpftrace script As mentioned, I don't think it's a frequent occurence in the sense that it happens all the time, but it's one of those "if it hits, we're screwed" cases as now we're way over budget. And we have zero limiting in place to prevent it from happening. As such it doesn't really matter how frequent it is, all that matters is that it cannot happen. In terms of a different approach, eg "we can tolerate a bit more overhead for the overrun case happening", then yeah that'd be interesting as it may guide improvements. But the frequency of it occuring for one case doesn't mean that it won't be a "hits 50% of the time" for some other case. Living with a separate retry list (which is lockless, after all) seems to me to be the least of all evils, as the overhead is both limited and constant in all cases. I'd rather entertain NOT using llists for this in the first place, as it gets rid of the reversing which is the main cost here. That won't change the need for a retry list necessarily, as I think we'd be better off with a lockless retry list still. But at least it'd get rid of the reversing. Let me see if I can dig out that patch... Totally orthogonal to this topic, obviously. -- Jens Axboe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 15:15 ` Jens Axboe @ 2024-11-21 15:22 ` Jens Axboe 2024-11-21 16:00 ` Pavel Begunkov 2024-11-21 16:18 ` Pavel Begunkov 0 siblings, 2 replies; 30+ messages in thread From: Jens Axboe @ 2024-11-21 15:22 UTC (permalink / raw) To: Pavel Begunkov, David Wei, io-uring On 11/21/24 8:15 AM, Jens Axboe wrote: > I'd rather entertain NOT using llists for this in the first place, as it > gets rid of the reversing which is the main cost here. That won't change > the need for a retry list necessarily, as I think we'd be better off > with a lockless retry list still. But at least it'd get rid of the > reversing. Let me see if I can dig out that patch... Totally orthogonal > to this topic, obviously. It's here: https://lore.kernel.org/io-uring/[email protected]/ I did improve it further but never posted it again, fwiw. -- Jens Axboe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 15:22 ` Jens Axboe @ 2024-11-21 16:00 ` Pavel Begunkov 2024-11-21 16:05 ` Jens Axboe 2024-11-21 16:18 ` Pavel Begunkov 1 sibling, 1 reply; 30+ messages in thread From: Pavel Begunkov @ 2024-11-21 16:00 UTC (permalink / raw) To: Jens Axboe, David Wei, io-uring On 11/21/24 15:22, Jens Axboe wrote: > On 11/21/24 8:15 AM, Jens Axboe wrote: >> I'd rather entertain NOT using llists for this in the first place, as it >> gets rid of the reversing which is the main cost here. That won't change >> the need for a retry list necessarily, as I think we'd be better off >> with a lockless retry list still. But at least it'd get rid of the >> reversing. Let me see if I can dig out that patch... Totally orthogonal >> to this topic, obviously. > > It's here: > > https://lore.kernel.org/io-uring/[email protected]/ > > I did improve it further but never posted it again, fwiw. It's nice that with sth like that we're not restricted by space and be smarter about batching, e.g. splitting nr_tw into buckets. However, the overhead of spinlock could be very hard if there is contention. With block it's more uniform which CPU tw comes from, but with network it could be much more random. That's what Dylan measured back than, and quite a similar situation that you've seen yourself before is with socket locks. Another option is to try out how a lockless list (instead of stack) with double cmpxchg would perform. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 16:00 ` Pavel Begunkov @ 2024-11-21 16:05 ` Jens Axboe 0 siblings, 0 replies; 30+ messages in thread From: Jens Axboe @ 2024-11-21 16:05 UTC (permalink / raw) To: Pavel Begunkov, David Wei, io-uring On 11/21/24 9:00 AM, Pavel Begunkov wrote: > On 11/21/24 15:22, Jens Axboe wrote: >> On 11/21/24 8:15 AM, Jens Axboe wrote: >>> I'd rather entertain NOT using llists for this in the first place, as it >>> gets rid of the reversing which is the main cost here. That won't change >>> the need for a retry list necessarily, as I think we'd be better off >>> with a lockless retry list still. But at least it'd get rid of the >>> reversing. Let me see if I can dig out that patch... Totally orthogonal >>> to this topic, obviously. >> >> It's here: >> >> https://lore.kernel.org/io-uring/[email protected]/ >> >> I did improve it further but never posted it again, fwiw. > It's nice that with sth like that we're not restricted by space and be > smarter about batching, e.g. splitting nr_tw into buckets. However, the > overhead of spinlock could be very hard if there is contention. With This is true, but it's also equally true for the llist - if you have contention on adding vs running, then you'll be bouncing the cacheline regardless. With the spinlock, you also have the added overhead of the IRQ disabling. > block it's more uniform which CPU tw comes from, but with network it > could be much more random. That's what Dylan measured back than, and > quite a similar situation that you've seen yourself before is with > socket locks. I'm sure he found the llist to be preferable, but that was also before we had to add the reversing. So not so clear cut anymore, may push it over the edge as I bet there was not much of a difference before. At least when I benchmarked this back in March, it wasn't like llist was a clear winner. > Another option is to try out how a lockless list (instead of stack) > with double cmpxchg would perform. I did ponder that and even talked to Paul about it as well, but I never benchmarked that one. I can try and resurrect that effort. One annoyance there is that we need arch support, but I don't think it's a big deal as the alternative is just to fallback to spinlock + normal list if it's not available. -- Jens Axboe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 15:22 ` Jens Axboe 2024-11-21 16:00 ` Pavel Begunkov @ 2024-11-21 16:18 ` Pavel Begunkov 2024-11-21 16:20 ` Jens Axboe 1 sibling, 1 reply; 30+ messages in thread From: Pavel Begunkov @ 2024-11-21 16:18 UTC (permalink / raw) To: Jens Axboe, David Wei, io-uring On 11/21/24 15:22, Jens Axboe wrote: > On 11/21/24 8:15 AM, Jens Axboe wrote: >> I'd rather entertain NOT using llists for this in the first place, as it >> gets rid of the reversing which is the main cost here. That won't change >> the need for a retry list necessarily, as I think we'd be better off >> with a lockless retry list still. But at least it'd get rid of the >> reversing. Let me see if I can dig out that patch... Totally orthogonal >> to this topic, obviously. > > It's here: > > https://lore.kernel.org/io-uring/[email protected]/ > > I did improve it further but never posted it again, fwiw. io_req_local_work_add() needs a smp_mb() after unlock, see comments, release/unlock doesn't do it. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 16:18 ` Pavel Begunkov @ 2024-11-21 16:20 ` Jens Axboe 2024-11-21 16:43 ` Pavel Begunkov 0 siblings, 1 reply; 30+ messages in thread From: Jens Axboe @ 2024-11-21 16:20 UTC (permalink / raw) To: Pavel Begunkov, David Wei, io-uring On 11/21/24 9:18 AM, Pavel Begunkov wrote: > On 11/21/24 15:22, Jens Axboe wrote: >> On 11/21/24 8:15 AM, Jens Axboe wrote: >>> I'd rather entertain NOT using llists for this in the first place, as it >>> gets rid of the reversing which is the main cost here. That won't change >>> the need for a retry list necessarily, as I think we'd be better off >>> with a lockless retry list still. But at least it'd get rid of the >>> reversing. Let me see if I can dig out that patch... Totally orthogonal >>> to this topic, obviously. >> >> It's here: >> >> https://lore.kernel.org/io-uring/[email protected]/ >> >> I did improve it further but never posted it again, fwiw. > > io_req_local_work_add() needs a smp_mb() after unlock, see comments, > release/unlock doesn't do it. Yep, current version I have adds a smp_mb__after_unlock_lock() for that. Will do some quick testing, but then also try the double cmpxchg on top of that if supported. -- Jens Axboe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 16:20 ` Jens Axboe @ 2024-11-21 16:43 ` Pavel Begunkov 2024-11-21 16:57 ` Jens Axboe 0 siblings, 1 reply; 30+ messages in thread From: Pavel Begunkov @ 2024-11-21 16:43 UTC (permalink / raw) To: Jens Axboe, David Wei, io-uring On 11/21/24 16:20, Jens Axboe wrote: > On 11/21/24 9:18 AM, Pavel Begunkov wrote: >> On 11/21/24 15:22, Jens Axboe wrote: >>> On 11/21/24 8:15 AM, Jens Axboe wrote: >>>> I'd rather entertain NOT using llists for this in the first place, as it >>>> gets rid of the reversing which is the main cost here. That won't change >>>> the need for a retry list necessarily, as I think we'd be better off >>>> with a lockless retry list still. But at least it'd get rid of the >>>> reversing. Let me see if I can dig out that patch... Totally orthogonal >>>> to this topic, obviously. >>> >>> It's here: >>> >>> https://lore.kernel.org/io-uring/[email protected]/ >>> >>> I did improve it further but never posted it again, fwiw. >> >> io_req_local_work_add() needs a smp_mb() after unlock, see comments, >> release/unlock doesn't do it. > > Yep, current version I have adds a smp_mb__after_unlock_lock() for that. I don't think it'd be correct. unlock_lock AFAIK is specifically for unlock + lock, you have lock + unlock. And data you want to synchronise is modified after the lock part. That'd need upgrading the release semantics implied by the unlock to a full barrier. I doubt there is a good way to optimise it. I doubt it'd give you anything even if you replace store_release in spin_unlock with xchg() and ignore the return, but you can probably ask Paul. > Will do some quick testing, but then also try the double cmpxchg on top > of that if supported. > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 16:43 ` Pavel Begunkov @ 2024-11-21 16:57 ` Jens Axboe 2024-11-21 17:05 ` Jens Axboe 0 siblings, 1 reply; 30+ messages in thread From: Jens Axboe @ 2024-11-21 16:57 UTC (permalink / raw) To: Pavel Begunkov, David Wei, io-uring On 11/21/24 9:43 AM, Pavel Begunkov wrote: > On 11/21/24 16:20, Jens Axboe wrote: >> On 11/21/24 9:18 AM, Pavel Begunkov wrote: >>> On 11/21/24 15:22, Jens Axboe wrote: >>>> On 11/21/24 8:15 AM, Jens Axboe wrote: >>>>> I'd rather entertain NOT using llists for this in the first place, as it >>>>> gets rid of the reversing which is the main cost here. That won't change >>>>> the need for a retry list necessarily, as I think we'd be better off >>>>> with a lockless retry list still. But at least it'd get rid of the >>>>> reversing. Let me see if I can dig out that patch... Totally orthogonal >>>>> to this topic, obviously. >>>> >>>> It's here: >>>> >>>> https://lore.kernel.org/io-uring/[email protected]/ >>>> >>>> I did improve it further but never posted it again, fwiw. >>> >>> io_req_local_work_add() needs a smp_mb() after unlock, see comments, >>> release/unlock doesn't do it. >> >> Yep, current version I have adds a smp_mb__after_unlock_lock() for that. > > I don't think it'd be correct. unlock_lock AFAIK is specifically > for unlock + lock, you have lock + unlock. And data you want to > synchronise is modified after the lock part. That'd need upgrading > the release semantics implied by the unlock to a full barrier. > > I doubt there is a good way to optimise it. I doubt it'd give you > anything even if you replace store_release in spin_unlock with xchg() > and ignore the return, but you can probably ask Paul. True, will just make it an smp_mb(), should be fine. >> Will do some quick testing, but then also try the double cmpxchg on top >> of that if supported. This is a bit trickier, as we're not just updating the list first/last entries... Not sure I see a good way to do this. Maybe I'm missing something. I did run a basic IRQ storage test as-is, and will compare that with the llist stuff we have now. Just in terms of overhead. It's not quite a networking test, but you do get the IRQ side and some burstiness in terms of completions that way too, at high rates. So should be roughly comparable. -- Jens Axboe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 16:57 ` Jens Axboe @ 2024-11-21 17:05 ` Jens Axboe 2024-11-22 17:01 ` Pavel Begunkov 0 siblings, 1 reply; 30+ messages in thread From: Jens Axboe @ 2024-11-21 17:05 UTC (permalink / raw) To: Pavel Begunkov, David Wei, io-uring [-- Attachment #1: Type: text/plain, Size: 701 bytes --] On 11/21/24 9:57 AM, Jens Axboe wrote: > I did run a basic IRQ storage test as-is, and will compare that with the > llist stuff we have now. Just in terms of overhead. It's not quite a > networking test, but you do get the IRQ side and some burstiness in > terms of completions that way too, at high rates. So should be roughly > comparable. Perf looks comparable, it's about 60M IOPS. Some fluctuation with IRQ driven, so won't render an opinion on whether one is faster than the other. What is visible though is that adding and running local task_work drops from 2.39% to 2.02% using spinlock + io_wq_work_list over llist, and we entirely drop 2.2% of list reversing in the process. -- Jens Axboe [-- Attachment #2: 0002-io_uring-replace-defer-task_work-llist-with-io_wq_wo.patch --] [-- Type: text/x-patch, Size: 11165 bytes --] From 3690e69ed6c7a4115acc56735dce4434b1105cc3 Mon Sep 17 00:00:00 2001 From: Jens Axboe <[email protected]> Date: Thu, 21 Nov 2024 09:18:19 -0700 Subject: [PATCH 2/2] io_uring: replace defer task_work llist with io_wq_work_list Add a spinlock for the list, and replace the lockless llist with the work list instead. This avoids needing to reverse items in the list before running them, as the io_wq_work_list is FIFO by nature whereas the llist is LIFO. Signed-off-by: Jens Axboe <[email protected]> --- include/linux/io_uring_types.h | 12 +- io_uring/io_uring.c | 196 ++++++++++++++++----------------- io_uring/io_uring.h | 2 +- 3 files changed, 101 insertions(+), 109 deletions(-) diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 011860ade268..03a102ca8105 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -335,8 +335,9 @@ struct io_ring_ctx { * regularly bounce b/w CPUs. */ struct { - struct llist_head work_llist; - struct llist_head retry_llist; + struct io_wq_work_list work_list; + spinlock_t work_lock; + int work_items; unsigned long check_cq; atomic_t cq_wait_nr; atomic_t cq_timeouts; @@ -566,7 +567,10 @@ enum { typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts); struct io_task_work { - struct llist_node node; + union { + struct io_wq_work_node work_node; + struct llist_node node; + }; io_req_tw_func_t func; }; @@ -622,8 +626,6 @@ struct io_kiocb { */ u16 buf_index; - unsigned nr_tw; - /* REQ_F_* flags */ io_req_flags_t flags; diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index c3a7d0197636..07e63e532797 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -338,9 +338,10 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) INIT_LIST_HEAD(&ctx->io_buffers_comp); INIT_LIST_HEAD(&ctx->defer_list); INIT_LIST_HEAD(&ctx->timeout_list); - INIT_LIST_HEAD(&ctx->ltimeout_list); - init_llist_head(&ctx->work_llist); INIT_LIST_HEAD(&ctx->tctx_list); + INIT_LIST_HEAD(&ctx->ltimeout_list); + INIT_WQ_LIST(&ctx->work_list); + spin_lock_init(&ctx->work_lock); ctx->submit_state.free_list.next = NULL; INIT_HLIST_HEAD(&ctx->waitid_list); #ifdef CONFIG_FUTEX @@ -1066,25 +1067,32 @@ struct llist_node *io_handle_tw_list(struct llist_node *node, return node; } -static __cold void __io_fallback_tw(struct llist_node *node, bool sync) +static __cold void __io_fallback_tw(struct io_kiocb *req, bool sync, + struct io_ring_ctx **last_ctx) +{ + if (sync && *last_ctx != req->ctx) { + if (*last_ctx) { + flush_delayed_work(&(*last_ctx)->fallback_work); + percpu_ref_put(&(*last_ctx)->refs); + } + *last_ctx = req->ctx; + percpu_ref_get(&(*last_ctx)->refs); + } + if (llist_add(&req->io_task_work.node, &req->ctx->fallback_llist)) + schedule_delayed_work(&req->ctx->fallback_work, 1); + +} + +static void io_fallback_tw(struct io_uring_task *tctx, bool sync) { + struct llist_node *node = llist_del_all(&tctx->task_list); struct io_ring_ctx *last_ctx = NULL; struct io_kiocb *req; while (node) { req = container_of(node, struct io_kiocb, io_task_work.node); node = node->next; - if (sync && last_ctx != req->ctx) { - if (last_ctx) { - flush_delayed_work(&last_ctx->fallback_work); - percpu_ref_put(&last_ctx->refs); - } - last_ctx = req->ctx; - percpu_ref_get(&last_ctx->refs); - } - if (llist_add(&req->io_task_work.node, - &req->ctx->fallback_llist)) - schedule_delayed_work(&req->ctx->fallback_work, 1); + __io_fallback_tw(req, sync, &last_ctx); } if (last_ctx) { @@ -1093,13 +1101,6 @@ static __cold void __io_fallback_tw(struct llist_node *node, bool sync) } } -static void io_fallback_tw(struct io_uring_task *tctx, bool sync) -{ - struct llist_node *node = llist_del_all(&tctx->task_list); - - __io_fallback_tw(node, sync); -} - struct llist_node *tctx_task_work_run(struct io_uring_task *tctx, unsigned int max_entries, unsigned int *count) @@ -1139,73 +1140,45 @@ void tctx_task_work(struct callback_head *cb) static inline void io_req_local_work_add(struct io_kiocb *req, struct io_ring_ctx *ctx, - unsigned flags) + unsigned tw_flags) { - unsigned nr_wait, nr_tw, nr_tw_prev; - struct llist_node *head; + unsigned long flags; + unsigned nr_tw; /* See comment above IO_CQ_WAKE_INIT */ BUILD_BUG_ON(IO_CQ_WAKE_FORCE <= IORING_MAX_CQ_ENTRIES); /* - * We don't know how many reuqests is there in the link and whether - * they can even be queued lazily, fall back to non-lazy. + * We don't know how many requests are in the link and whether they can + * even be queued lazily, fall back to non-lazy. */ if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) - flags &= ~IOU_F_TWQ_LAZY_WAKE; + tw_flags &= ~IOU_F_TWQ_LAZY_WAKE; - guard(rcu)(); - - head = READ_ONCE(ctx->work_llist.first); - do { - nr_tw_prev = 0; - if (head) { - struct io_kiocb *first_req = container_of(head, - struct io_kiocb, - io_task_work.node); - /* - * Might be executed at any moment, rely on - * SLAB_TYPESAFE_BY_RCU to keep it alive. - */ - nr_tw_prev = READ_ONCE(first_req->nr_tw); - } - - /* - * Theoretically, it can overflow, but that's fine as one of - * previous adds should've tried to wake the task. - */ - nr_tw = nr_tw_prev + 1; - if (!(flags & IOU_F_TWQ_LAZY_WAKE)) - nr_tw = IO_CQ_WAKE_FORCE; - - req->nr_tw = nr_tw; - req->io_task_work.node.next = head; - } while (!try_cmpxchg(&ctx->work_llist.first, &head, - &req->io_task_work.node)); + spin_lock_irqsave(&ctx->work_lock, flags); + wq_list_add_tail(&req->io_task_work.work_node, &ctx->work_list); + nr_tw = ++ctx->work_items; + if (!(tw_flags & IOU_F_TWQ_LAZY_WAKE)) + nr_tw = IO_CQ_WAKE_FORCE; + spin_unlock_irqrestore(&ctx->work_lock, flags); /* - * cmpxchg implies a full barrier, which pairs with the barrier + * We need a barrier after unlock, which pairs with the barrier * in set_current_state() on the io_cqring_wait() side. It's used * to ensure that either we see updated ->cq_wait_nr, or waiters * going to sleep will observe the work added to the list, which * is similar to the wait/wawke task state sync. */ - - if (!head) { + smp_mb(); + if (nr_tw == 1) { if (ctx->flags & IORING_SETUP_TASKRUN_FLAG) atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags); if (ctx->has_evfd) io_eventfd_signal(ctx); } - nr_wait = atomic_read(&ctx->cq_wait_nr); - /* not enough or no one is waiting */ - if (nr_tw < nr_wait) - return; - /* the previous add has already woken it up */ - if (nr_tw_prev >= nr_wait) - return; - wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE); + if (nr_tw >= atomic_read(&ctx->cq_wait_nr)) + wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE); } static void io_req_normal_work_add(struct io_kiocb *req) @@ -1253,11 +1226,27 @@ void io_req_task_work_add_remote(struct io_kiocb *req, struct io_ring_ctx *ctx, static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx) { - struct llist_node *node = llist_del_all(&ctx->work_llist); + struct io_ring_ctx *last_ctx = NULL; + struct io_wq_work_node *node; + unsigned long flags; + + spin_lock_irqsave(&ctx->work_lock, flags); + node = ctx->work_list.first; + INIT_WQ_LIST(&ctx->work_list); + ctx->work_items = 0; + spin_unlock_irqrestore(&ctx->work_lock, flags); + + while (node) { + struct io_kiocb *req; - __io_fallback_tw(node, false); - node = llist_del_all(&ctx->retry_llist); - __io_fallback_tw(node, false); + req = container_of(node, struct io_kiocb, io_task_work.work_node); + node = node->next; + __io_fallback_tw(req, false, &last_ctx); + } + if (last_ctx) { + flush_delayed_work(&last_ctx->fallback_work); + percpu_ref_put(&last_ctx->refs); + } } static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events, @@ -1272,51 +1261,52 @@ static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events, return false; } -static int __io_run_local_work_loop(struct llist_node **node, - struct io_tw_state *ts, - int events) -{ - while (*node) { - struct llist_node *next = (*node)->next; - struct io_kiocb *req = container_of(*node, struct io_kiocb, - io_task_work.node); - INDIRECT_CALL_2(req->io_task_work.func, - io_poll_task_func, io_req_rw_complete, - req, ts); - *node = next; - if (--events <= 0) - break; - } - - return events; -} - static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts, int min_events) { - struct llist_node *node; - unsigned int loops = 0; - int ret, limit; + struct io_wq_work_node *node, *tail; + unsigned int loops = 1; + int ret, limit, nitems; + unsigned long flags; if (WARN_ON_ONCE(ctx->submitter_task != current)) return -EEXIST; if (ctx->flags & IORING_SETUP_TASKRUN_FLAG) atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags); limit = max(IO_LOCAL_TW_DEFAULT_MAX, min_events); + again: - ret = __io_run_local_work_loop(&ctx->retry_llist.first, ts, limit); - if (ctx->retry_llist.first) + spin_lock_irqsave(&ctx->work_lock, flags); + node = ctx->work_list.first; + tail = ctx->work_list.last; + nitems = ctx->work_items; + INIT_WQ_LIST(&ctx->work_list); + ctx->work_items = 0; + spin_unlock_irqrestore(&ctx->work_lock, flags); + + while (node) { + struct io_kiocb *req = container_of(node, struct io_kiocb, + io_task_work.work_node); + node = node->next; + INDIRECT_CALL_2(req->io_task_work.func, + io_poll_task_func, io_req_rw_complete, + req, ts); + if (++ret >= limit) + break; + } + + if (unlikely(node)) { + spin_lock_irqsave(&ctx->work_lock, flags); + tail->next = ctx->work_list.first; + ctx->work_list.first = node; + if (!ctx->work_list.last) + ctx->work_list.last = tail; + ctx->work_items += nitems - ret; + spin_unlock_irqrestore(&ctx->work_lock, flags); goto retry_done; + } - /* - * llists are in reverse order, flip it back the right way before - * running the pending items. - */ - node = llist_reverse_order(llist_del_all(&ctx->work_llist)); - ret = __io_run_local_work_loop(&node, ts, ret); - ctx->retry_llist.first = node; loops++; - ret = limit - ret; if (io_run_local_work_continue(ctx, ret, min_events)) goto again; @@ -2413,7 +2403,7 @@ static enum hrtimer_restart io_cqring_min_timer_wakeup(struct hrtimer *timer) if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) { atomic_set(&ctx->cq_wait_nr, 1); smp_mb(); - if (!llist_empty(&ctx->work_llist)) + if (io_local_work_pending(ctx)) goto out_wake; } diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 214f9f175102..2fae27803116 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -349,7 +349,7 @@ static inline int io_run_task_work(void) static inline bool io_local_work_pending(struct io_ring_ctx *ctx) { - return !llist_empty(&ctx->work_llist) || !llist_empty(&ctx->retry_llist); + return READ_ONCE(ctx->work_list.first); } static inline bool io_task_work_pending(struct io_ring_ctx *ctx) -- 2.45.2 [-- Attachment #3: 0001-io_uring-make-task_work-pending-check-dependent-on-r.patch --] [-- Type: text/x-patch, Size: 1141 bytes --] From efb35f94fd1d66690fe31f0ee578479e0787bbf8 Mon Sep 17 00:00:00 2001 From: Jens Axboe <[email protected]> Date: Thu, 21 Nov 2024 08:27:10 -0700 Subject: [PATCH 1/2] io_uring: make task_work pending check dependent on ring type There's no need to check for generic task_work for DEFER_TASKRUN, if we have local task_work pending. This avoids dipping into the huge task_struct, if we have normal task_work pending. Signed-off-by: Jens Axboe <[email protected]> --- io_uring/io_uring.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 12abee607e4a..214f9f175102 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -354,7 +354,9 @@ static inline bool io_local_work_pending(struct io_ring_ctx *ctx) static inline bool io_task_work_pending(struct io_ring_ctx *ctx) { - return task_work_pending(current) || io_local_work_pending(ctx); + if (ctx->flags & IORING_SETUP_DEFER_TASKRUN && io_local_work_pending(ctx)) + return true; + return task_work_pending(current); } static inline void io_tw_lock(struct io_ring_ctx *ctx, struct io_tw_state *ts) -- 2.45.2 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 17:05 ` Jens Axboe @ 2024-11-22 17:01 ` Pavel Begunkov 2024-11-22 17:08 ` Jens Axboe 0 siblings, 1 reply; 30+ messages in thread From: Pavel Begunkov @ 2024-11-22 17:01 UTC (permalink / raw) To: Jens Axboe, David Wei, io-uring On 11/21/24 17:05, Jens Axboe wrote: > On 11/21/24 9:57 AM, Jens Axboe wrote: >> I did run a basic IRQ storage test as-is, and will compare that with the >> llist stuff we have now. Just in terms of overhead. It's not quite a >> networking test, but you do get the IRQ side and some burstiness in >> terms of completions that way too, at high rates. So should be roughly >> comparable. > > Perf looks comparable, it's about 60M IOPS. Some fluctuation with IRQ 60M with iopoll? That one normally shouldn't use use task_work > driven, so won't render an opinion on whether one is faster than the > other. What is visible though is that adding and running local task_work > drops from 2.39% to 2.02% using spinlock + io_wq_work_list over llist, Do you summed it up with io_req_local_work_add()? Just sounds a bit weird since it's usually run off [soft]irq. I have doubts that part became faster. Running could be, especially with high QD and consistency of SSD. Btw, what QD was it? 32? > and we entirely drop 2.2% of list reversing in the process. We actually discussed it before but in some different patchset, perf is not helpful much here, the overhead and cache loading moves around a lot between functions. I don't think we have a solid proof here, especially for networking workloads, which tend to hammer it more from more CPUs. Can we run some net benchmarks? Even better to do a good prod experiment. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-22 17:01 ` Pavel Begunkov @ 2024-11-22 17:08 ` Jens Axboe 2024-11-23 0:50 ` Pavel Begunkov 0 siblings, 1 reply; 30+ messages in thread From: Jens Axboe @ 2024-11-22 17:08 UTC (permalink / raw) To: Pavel Begunkov, David Wei, io-uring On 11/22/24 10:01 AM, Pavel Begunkov wrote: > On 11/21/24 17:05, Jens Axboe wrote: >> On 11/21/24 9:57 AM, Jens Axboe wrote: >>> I did run a basic IRQ storage test as-is, and will compare that with the >>> llist stuff we have now. Just in terms of overhead. It's not quite a >>> networking test, but you do get the IRQ side and some burstiness in >>> terms of completions that way too, at high rates. So should be roughly >>> comparable. >> >> Perf looks comparable, it's about 60M IOPS. Some fluctuation with IRQ > > 60M with iopoll? That one normally shouldn't use use task_work Maybe that wasn't clear, but it's IRQ driven IO. Otherwise indeed there'd be no task_work in use. >> driven, so won't render an opinion on whether one is faster than the >> other. What is visible though is that adding and running local task_work >> drops from 2.39% to 2.02% using spinlock + io_wq_work_list over llist, > > Do you summed it up with io_req_local_work_add()? Just sounds a bit > weird since it's usually run off [soft]irq. I have doubts that part > became faster. Running could be, especially with high QD and > consistency of SSD. Btw, what QD was it? 32? It may just trigger more in frequency in terms of profiling, since the list reversal is done. Profiling isn't 100% exact. >> and we entirely drop 2.2% of list reversing in the process. > > We actually discussed it before but in some different patchset, > perf is not helpful much here, the overhead and cache loading > moves around a lot between functions. > > I don't think we have a solid proof here, especially for networking > workloads, which tend to hammer it more from more CPUs. Can we run > some net benchmarks? Even better to do a good prod experiment. Already in motion. I ran some here and didn't show any differences at all, but task_work load was also fairly light. David is running the networking side and we'll see what it says. I don't particularly love list + lock for this, but at the end of the day, the only real downside is the irq disabling nature of it. Everything else is both simpler, and avoids the really annoying LIFO nature of llist. I'd expect, all things being equal, that list + lock is going to be ever so slightly slower. Both will bounce the list cacheline, no difference in cost on that side. But when you add list reversal to the mix, that's going to push it to being an overall win. -- Jens Axboe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-22 17:08 ` Jens Axboe @ 2024-11-23 0:50 ` Pavel Begunkov 0 siblings, 0 replies; 30+ messages in thread From: Pavel Begunkov @ 2024-11-23 0:50 UTC (permalink / raw) To: Jens Axboe, David Wei, io-uring On 11/22/24 17:08, Jens Axboe wrote: > On 11/22/24 10:01 AM, Pavel Begunkov wrote: >> On 11/21/24 17:05, Jens Axboe wrote: >>> On 11/21/24 9:57 AM, Jens Axboe wrote: >>>> I did run a basic IRQ storage test as-is, and will compare that with the >>>> llist stuff we have now. Just in terms of overhead. It's not quite a >>>> networking test, but you do get the IRQ side and some burstiness in >>>> terms of completions that way too, at high rates. So should be roughly >>>> comparable. >>> >>> Perf looks comparable, it's about 60M IOPS. Some fluctuation with IRQ >> >> 60M with iopoll? That one normally shouldn't use use task_work > > Maybe that wasn't clear, but it's IRQ driven IO. Otherwise indeed > there'd be no task_work in use. > >>> driven, so won't render an opinion on whether one is faster than the >>> other. What is visible though is that adding and running local task_work >>> drops from 2.39% to 2.02% using spinlock + io_wq_work_list over llist, >> >> Do you summed it up with io_req_local_work_add()? Just sounds a bit >> weird since it's usually run off [soft]irq. I have doubts that part >> became faster. Running could be, especially with high QD and >> consistency of SSD. Btw, what QD was it? 32? Why I asked about QD is because storage tests reliably give you a list of QD task work items, the longer the list the more expensive the reverse with washing out cache lines. For QD=32 it's 32 entry list reversal, so I'd get if you're seeing some perf imrpovement. With QD=1 would be the opposite. With David's thing is similar, he gets a long list because of wait based batching. Users who don't do it might get a worse performance (which might be fine). > It may just trigger more in frequency in terms of profiling, since the > list reversal is done. Profiling isn't 100% exact. > >>> and we entirely drop 2.2% of list reversing in the process. >> >> We actually discussed it before but in some different patchset, >> perf is not helpful much here, the overhead and cache loading >> moves around a lot between functions. >> >> I don't think we have a solid proof here, especially for networking >> workloads, which tend to hammer it more from more CPUs. Can we run >> some net benchmarks? Even better to do a good prod experiment. > > Already in motion. I ran some here and didn't show any differences at > all, but task_work load was also fairly light. David is running the > networking side and we'll see what it says. That's great, if it survives high traffic prod there should be less need to worry about it in terms of regressing. The eerie part is that we're switching it back and forth rediscovering same problems. Even the reordering issue was mentioned and warned about before the wait-free list got merged, but successfully ignored until we've got latency issues. And now we're back the full circle. Would be nice to find some peace (or something inarguably better). > I don't particularly love list + lock for this, but at the end of the > day, the only real downside is the irq disabling nature of it. > Everything else is both simpler, and avoids the really annoying LIFO > nature of llist. I'd expect, all things being equal, that list + lock is > going to be ever so slightly slower. Both will bounce the list > cacheline, no difference in cost on that side. But when you add list > reversal to the mix, that's going to push it to being an overall win. > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 15:07 ` Pavel Begunkov 2024-11-21 15:15 ` Jens Axboe @ 2024-11-21 17:53 ` David Wei 2024-11-22 15:57 ` Pavel Begunkov 1 sibling, 1 reply; 30+ messages in thread From: David Wei @ 2024-11-21 17:53 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe, io-uring On 2024-11-21 07:07, Pavel Begunkov wrote: > On 11/21/24 14:31, Jens Axboe wrote: >> On 11/21/24 7:25 AM, Pavel Begunkov wrote: >>> On 11/21/24 01:12, Jens Axboe wrote: >>>> On 11/20/24 4:56 PM, Pavel Begunkov wrote: >>>>> On 11/20/24 22:14, David Wei wrote: > ... >>>> I think that can only work if we change work_llist to be a regular list >>>> with regular locking. Otherwise it's a bit of a mess with the list being >>> >>> Dylan once measured the overhead of locks vs atomics in this >>> path for some artificial case, we can pull the numbers up. >> >> I did it more recently if you'll remember, actually posted a patch I >> think a few months ago changing it to that. But even that approach adds > > Right, and it's be a separate topic from this set. > >> extra overhead, if you want to add it to the same list as now you need > > Extra overhead to the retry path, which is not the hot path, > and coldness of it is uncertain. > >> to re-grab (and re-disable interrupts) the lock to add it back. My gut >> says that would be _worse_ than the current approach. And if you keep a >> separate list instead, well then you're back to identical overhead in >> terms of now needing to check both when needing to know if anything is >> pending, and checking both when running it. >> >>>> reordered, and then you're spending extra cycles on potentially >>>> reordering all the entries again. >>> >>> That sucks, I agree, but then it's same question of how often >>> it happens. >> >> At least for now, there's a real issue reported and we should fix it. I >> think the current patches are fine in that regard. That doesn't mean we >> can't potentially make it better, we should certainly investigate that. >> But I don't see the current patches as being suboptimal really, they are >> definitely good enough as-is for solving the issue. > > That's fair enough, but I still would love to know how frequent > it is. There is no purpose in optimising it as hot/slow path if > it triggers every fifth run or such. David, how easy it is to > get some stats? We can hack up some bpftrace script > Here is a sample distribution of how many task work is done per __io_run_local_work(): @work_done: [1] 15385954 |@ | [2, 4) 33424809 |@@@@ | [4, 8) 196055270 |@@@@@@@@@@@@@@@@@@@@@@@@ | [8, 16) 419060191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 48395043 |@@@@@@ | [32, 64) 1573469 | | [64, 128) 98151 | | [128, 256) 14288 | | [256, 512) 2035 | | [512, 1K) 268 | | [1K, 2K) 13 | | This workload had wait_nr set to 20 and the timeout set to 500 µs. Empirically, I know that any task work done > 50 will violate the latency limit for this workload. In these cases, all the requests must be dropped. So even if excessive task work happens in a small % of time, the impact is far larger than this. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 2/2] io_uring: limit local tw done 2024-11-21 17:53 ` David Wei @ 2024-11-22 15:57 ` Pavel Begunkov 0 siblings, 0 replies; 30+ messages in thread From: Pavel Begunkov @ 2024-11-22 15:57 UTC (permalink / raw) To: David Wei, Jens Axboe, io-uring On 11/21/24 17:53, David Wei wrote: > On 2024-11-21 07:07, Pavel Begunkov wrote: >> On 11/21/24 14:31, Jens Axboe wrote: >>> On 11/21/24 7:25 AM, Pavel Begunkov wrote: >>>> On 11/21/24 01:12, Jens Axboe wrote: >>>>> On 11/20/24 4:56 PM, Pavel Begunkov wrote: >>>>>> On 11/20/24 22:14, David Wei wrote: >> ... >>>>> I think that can only work if we change work_llist to be a regular list >>>>> with regular locking. Otherwise it's a bit of a mess with the list being >>>> >>>> Dylan once measured the overhead of locks vs atomics in this >>>> path for some artificial case, we can pull the numbers up. >>> >>> I did it more recently if you'll remember, actually posted a patch I >>> think a few months ago changing it to that. But even that approach adds >> >> Right, and it's be a separate topic from this set. >> >>> extra overhead, if you want to add it to the same list as now you need >> >> Extra overhead to the retry path, which is not the hot path, >> and coldness of it is uncertain. >> >>> to re-grab (and re-disable interrupts) the lock to add it back. My gut >>> says that would be _worse_ than the current approach. And if you keep a >>> separate list instead, well then you're back to identical overhead in >>> terms of now needing to check both when needing to know if anything is >>> pending, and checking both when running it. >>> >>>>> reordered, and then you're spending extra cycles on potentially >>>>> reordering all the entries again. >>>> >>>> That sucks, I agree, but then it's same question of how often >>>> it happens. >>> >>> At least for now, there's a real issue reported and we should fix it. I >>> think the current patches are fine in that regard. That doesn't mean we >>> can't potentially make it better, we should certainly investigate that. >>> But I don't see the current patches as being suboptimal really, they are >>> definitely good enough as-is for solving the issue. >> >> That's fair enough, but I still would love to know how frequent >> it is. There is no purpose in optimising it as hot/slow path if >> it triggers every fifth run or such. David, how easy it is to >> get some stats? We can hack up some bpftrace script >> > > Here is a sample distribution of how many task work is done per > __io_run_local_work(): > > @work_done: > [1] 15385954 |@ | > [2, 4) 33424809 |@@@@ | > [4, 8) 196055270 |@@@@@@@@@@@@@@@@@@@@@@@@ | > [8, 16) 419060191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16, 32) 48395043 |@@@@@@ | > [32, 64) 1573469 | | > [64, 128) 98151 | | > [128, 256) 14288 | | > [256, 512) 2035 | | > [512, 1K) 268 | | > [1K, 2K) 13 | | Nice > This workload had wait_nr set to 20 and the timeout set to 500 µs. > > Empirically, I know that any task work done > 50 will violate the > latency limit for this workload. In these cases, all the requests must > be dropped. So even if excessive task work happens in a small % of time, > the impact is far larger than this. So you've got a long tail, which spikes your nines, that makes sense. On the other hand it's perhaps 5-10% of total, though hard to judge as the [16,32) bucket is split by the constant 20. My guess would be a small optimisation for the normal case adding a bit more to the requeue may well worth it but depends on how sharp the skew in the bucket is. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 0/2] limit local tw done 2024-11-20 22:14 [PATCH next v1 0/2] limit local tw done David Wei 2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei 2024-11-20 22:14 ` [PATCH next v1 2/2] io_uring: limit local tw done David Wei @ 2024-11-21 1:12 ` Jens Axboe 2024-11-21 14:16 ` Jens Axboe 3 siblings, 0 replies; 30+ messages in thread From: Jens Axboe @ 2024-11-21 1:12 UTC (permalink / raw) To: David Wei, io-uring; +Cc: Pavel Begunkov On 11/20/24 3:14 PM, David Wei wrote: > Currently when running local tw it will eagerly try and complete all > available work. When there is a burst of work coming in, the list of > work in work_llist may be much larger than the requested batch count > wait_nr. Doing all of the work eagerly may cause latency issues for some > applications that do not want to spend so much time in the kernel. > > Limit the amount of local tw done to the max of 20 (a heuristic) or > wait_nr. This then does not require any userspace changes. > > Many applications will submit and wait with wait_nr = 1 to mean "wait > for _at least_ 1 completion, but do more work if available". This used > to mean "all work" but now that is capped to 20 requests. If users want > more work batching, then they can set wait_nr to > 20. Looks good to me! -- Jens Axboe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next v1 0/2] limit local tw done 2024-11-20 22:14 [PATCH next v1 0/2] limit local tw done David Wei ` (2 preceding siblings ...) 2024-11-21 1:12 ` [PATCH next v1 0/2] " Jens Axboe @ 2024-11-21 14:16 ` Jens Axboe 3 siblings, 0 replies; 30+ messages in thread From: Jens Axboe @ 2024-11-21 14:16 UTC (permalink / raw) To: io-uring, David Wei; +Cc: Pavel Begunkov On Wed, 20 Nov 2024 14:14:50 -0800, David Wei wrote: > Currently when running local tw it will eagerly try and complete all > available work. When there is a burst of work coming in, the list of > work in work_llist may be much larger than the requested batch count > wait_nr. Doing all of the work eagerly may cause latency issues for some > applications that do not want to spend so much time in the kernel. > > Limit the amount of local tw done to the max of 20 (a heuristic) or > wait_nr. This then does not require any userspace changes. > > [...] Applied, thanks! [1/2] io_uring: add io_local_work_pending() commit: 40cfe553240b32333b42652370ef5232e6ac59e1 [2/2] io_uring: limit local tw done commit: f46b9cdb22f7a167c36b6bcddaef7e8aee2598fa Best regards, -- Jens Axboe ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2024-11-23 0:49 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-11-20 22:14 [PATCH next v1 0/2] limit local tw done David Wei 2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei 2024-11-20 23:45 ` Pavel Begunkov 2024-11-20 22:14 ` [PATCH next v1 2/2] io_uring: limit local tw done David Wei 2024-11-20 23:56 ` Pavel Begunkov 2024-11-21 0:52 ` David Wei 2024-11-21 14:29 ` Pavel Begunkov 2024-11-21 14:34 ` Jens Axboe 2024-11-21 14:58 ` Pavel Begunkov 2024-11-21 15:02 ` Jens Axboe 2024-11-21 1:12 ` Jens Axboe 2024-11-21 14:25 ` Pavel Begunkov 2024-11-21 14:31 ` Jens Axboe 2024-11-21 15:07 ` Pavel Begunkov 2024-11-21 15:15 ` Jens Axboe 2024-11-21 15:22 ` Jens Axboe 2024-11-21 16:00 ` Pavel Begunkov 2024-11-21 16:05 ` Jens Axboe 2024-11-21 16:18 ` Pavel Begunkov 2024-11-21 16:20 ` Jens Axboe 2024-11-21 16:43 ` Pavel Begunkov 2024-11-21 16:57 ` Jens Axboe 2024-11-21 17:05 ` Jens Axboe 2024-11-22 17:01 ` Pavel Begunkov 2024-11-22 17:08 ` Jens Axboe 2024-11-23 0:50 ` Pavel Begunkov 2024-11-21 17:53 ` David Wei 2024-11-22 15:57 ` Pavel Begunkov 2024-11-21 1:12 ` [PATCH next v1 0/2] " Jens Axboe 2024-11-21 14:16 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox