From: David Wei <[email protected]>
To: Pavel Begunkov <[email protected]>, [email protected]
Cc: Jens Axboe <[email protected]>
Subject: Re: [PATCH next v1 2/2] io_uring: limit local tw done
Date: Wed, 20 Nov 2024 16:52:13 -0800 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 2024-11-20 15:56, Pavel Begunkov wrote:
> On 11/20/24 22:14, David Wei wrote:
>> Instead of eagerly running all available local tw, limit the amount of
>> local tw done to the max of IO_LOCAL_TW_DEFAULT_MAX (20) or wait_nr. The
>> value of 20 is chosen as a reasonable heuristic to allow enough work
>> batching but also keep latency down.
>>
>> Add a retry_llist that maintains a list of local tw that couldn't be
>> done in time. No synchronisation is needed since it is only modified
>> within the task context.
>>
>> Signed-off-by: David Wei <[email protected]>
>> ---
>> include/linux/io_uring_types.h | 1 +
>> io_uring/io_uring.c | 43 +++++++++++++++++++++++++---------
>> io_uring/io_uring.h | 2 +-
>> 3 files changed, 34 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index 593c10a02144..011860ade268 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -336,6 +336,7 @@ struct io_ring_ctx {
>> */
>> struct {
>> struct llist_head work_llist;
>> + struct llist_head retry_llist;
>
> Fwiw, probably doesn't matter, but it doesn't even need
> to be atomic, it's queued and spliced while holding
> ->uring_lock, the pending check is also synchronised as
> there is only one possible task doing that.
Yeah, it doesn't. Keeping it as a llist_head means being able to re-use
helpers that take llist_head or llist_node.
>
>> unsigned long check_cq;
>> atomic_t cq_wait_nr;
>> atomic_t cq_timeouts;
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 83bf041d2648..c3a7d0197636 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -121,6 +121,7 @@
> ...
>> static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts,
>> int min_events)
>> {
>> struct llist_node *node;
>> unsigned int loops = 0;
>> - int ret = 0;
>> + int ret, limit;
>> if (WARN_ON_ONCE(ctx->submitter_task != current))
>> return -EEXIST;
>> if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
>> atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
>> + limit = max(IO_LOCAL_TW_DEFAULT_MAX, min_events);
>> again:
>> + ret = __io_run_local_work_loop(&ctx->retry_llist.first, ts, limit);
>> + if (ctx->retry_llist.first)
>> + goto retry_done;
>> +
>> /*
>> * llists are in reverse order, flip it back the right way before
>> * running the pending items.
>> */
>> node = llist_reverse_order(llist_del_all(&ctx->work_llist));
>> - while (node) {
>> - struct llist_node *next = node->next;
>> - struct io_kiocb *req = container_of(node, struct io_kiocb,
>> - io_task_work.node);
>> - INDIRECT_CALL_2(req->io_task_work.func,
>> - io_poll_task_func, io_req_rw_complete,
>> - req, ts);
>> - ret++;
>> - node = next;
>> - }
>> + ret = __io_run_local_work_loop(&node, ts, ret);
>
> One thing that is not so nice is that now we have this handling and
> checks in the hot path, and __io_run_local_work_loop() most likely
> gets uninlined.
>
> I wonder, can we just requeue it via task_work again? We can even
> add a variant efficiently adding a list instead of a single entry,
> i.e. local_task_work_add(head, tail, ...);
That was an early idea, but it means re-reversing the list and then
atomically adding each node back to work_llist concurrently with e.g.
io_req_local_work_add().
Using a separate retry_llist means we don't need to concurrently add to
either retry_llist or work_llist.
>
> I'm also curious what's the use case you've got that is hitting
> the problem?
>
There is a Memcache-like workload that has load shedding based on the
time spent doing work. With epoll, the work of reading sockets and
processing a request is done by user, which can decide after some amount
of time to drop the remaining work if it takes too long. With io_uring,
the work of reading sockets is done eagerly inside of task work. If
there is a burst of work, then so much time is spent in task work
reading from sockets that, by the time control returns to user the
timeout has already elapsed.
next prev parent reply other threads:[~2024-11-21 0:52 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-20 22:14 [PATCH next v1 0/2] limit local tw done David Wei
2024-11-20 22:14 ` [PATCH next v1 1/2] io_uring: add io_local_work_pending() David Wei
2024-11-20 23:45 ` Pavel Begunkov
2024-11-20 22:14 ` [PATCH next v1 2/2] io_uring: limit local tw done David Wei
2024-11-20 23:56 ` Pavel Begunkov
2024-11-21 0:52 ` David Wei [this message]
2024-11-21 1:12 ` Jens Axboe
2024-11-21 1:12 ` [PATCH next v1 0/2] " Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox