From: Jens Axboe <[email protected]>
To: Pavel Begunkov <[email protected]>, [email protected]
Subject: Re: [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list
Date: Wed, 27 Mar 2024 09:45:40 -0600 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 3/27/24 7:24 AM, Pavel Begunkov wrote:
> On 3/26/24 18:42, Jens Axboe wrote:
>> Lockless lists may be handy for some things, but they mean that items
>> are in the reverse order as we can only add to the head of the list.
>> That in turn means that iterating items on the list needs to reverse it
>> first, if it's sensitive to ordering between items on the list.
>>
>> Switch the DEFER_TASKRUN work list from an llist to a normal
>> io_wq_work_list, and protect it with a lock. Then we can get rid of the
>> manual reversing of the list when running it, which takes considerable
>> cycles particularly for bursty task_work additions.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>> include/linux/io_uring_types.h | 11 ++--
>> io_uring/io_uring.c | 117 ++++++++++++---------------------
>> io_uring/io_uring.h | 4 +-
>> 3 files changed, 51 insertions(+), 81 deletions(-)
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index aeb4639785b5..e51bf15196e4 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -329,7 +329,9 @@ struct io_ring_ctx {
>> * regularly bounce b/w CPUs.
>
> ...
>
>> -static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>> +static inline void io_req_local_work_add(struct io_kiocb *req, unsigned tw_flags)
>> {
>> struct io_ring_ctx *ctx = req->ctx;
>> - unsigned nr_wait, nr_tw, nr_tw_prev;
>> - struct llist_node *head;
>> + unsigned nr_wait, nr_tw;
>> + unsigned long flags;
>> /* See comment above IO_CQ_WAKE_INIT */
>> BUILD_BUG_ON(IO_CQ_WAKE_FORCE <= IORING_MAX_CQ_ENTRIES);
>> /*
>> - * We don't know how many reuqests is there in the link and whether
>> + * We don't know how many requests is there in the link and whether
>> * they can even be queued lazily, fall back to non-lazy.
>> */
>> if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK))
>> - flags &= ~IOU_F_TWQ_LAZY_WAKE;
>> -
>> - head = READ_ONCE(ctx->work_llist.first);
>> - do {
>> - nr_tw_prev = 0;
>> - if (head) {
>> - struct io_kiocb *first_req = container_of(head,
>> - struct io_kiocb,
>> - io_task_work.node);
>> - /*
>> - * Might be executed at any moment, rely on
>> - * SLAB_TYPESAFE_BY_RCU to keep it alive.
>> - */
>> - nr_tw_prev = READ_ONCE(first_req->nr_tw);
>> - }
>> -
>> - /*
>> - * Theoretically, it can overflow, but that's fine as one of
>> - * previous adds should've tried to wake the task.
>> - */
>> - nr_tw = nr_tw_prev + 1;
>> - if (!(flags & IOU_F_TWQ_LAZY_WAKE))
>> - nr_tw = IO_CQ_WAKE_FORCE;
>
> Aren't you just killing the entire IOU_F_TWQ_LAZY_WAKE handling?
> It's assigned to IO_CQ_WAKE_FORCE so that it passes the check
> before wake_up below.
Yeah I messed that one up, did fix that one yesterday before sending it
out.
>> + tw_flags &= ~IOU_F_TWQ_LAZY_WAKE;
>> - req->nr_tw = nr_tw;
>> - req->io_task_work.node.next = head;
>> - } while (!try_cmpxchg(&ctx->work_llist.first, &head,
>> - &req->io_task_work.node));
>> + spin_lock_irqsave(&ctx->work_lock, flags);
>> + wq_list_add_tail(&req->io_task_work.node, &ctx->work_list);
>> + nr_tw = ++ctx->work_items;
>> + spin_unlock_irqrestore(&ctx->work_lock, flags);
>
> smp_mb(), see the comment below, and fwiw "_after_atomic" would not
> work.
For this one, I think all we need to do is have the wq_list_empty()
check be fully stable. If we read:
nr_wait = atomic_read(&ctx->cq_wait_nr);
right before a waiter does:
atomic_set(&ctx->cq_wait_nr, foo);
set_current_state(TASK_INTERRUPTIBLE);
then we need to ensure that the "I have work" check in
io_cqring_wait_schedule() sees the work. The spin_unlock() has release
semantics, and the current READ_ONCE() for work check sbould be enough,
no?
>> /*
>> * cmpxchg implies a full barrier, which pairs with the barrier
>> @@ -1289,7 +1254,7 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>> * is similar to the wait/wawke task state sync.
>> */
>> - if (!head) {
>> + if (nr_tw == 1) {
>> if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
>> atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
>> if (ctx->has_evfd)
>> @@ -1297,13 +1262,8 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>> }
>> nr_wait = atomic_read(&ctx->cq_wait_nr);
>> - /* not enough or no one is waiting */
>> - if (nr_tw < nr_wait)
>> - return;
>> - /* the previous add has already woken it up */
>> - if (nr_tw_prev >= nr_wait)
>> - return;
>> - wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
>> + if (nr_tw >= nr_wait)
>> + wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
>
> IIUC, you're removing a very important optimisation, and I
> don't understand why'd you do that. It was waking up only when
> it's changing from "don't need to wake" to "have to be woken up",
> just once per splicing the list on the waiting side.
>
> It seems like this one will keep pounding with wake_up_state for
> every single tw queued after @nr_wait, which quite often just 1.
Agree, that was just a silly oversight. I brought that back now.
Apparently doesn't hit anything here, at least not to the extent that I
saw it in testing. But it is a good idea and we should keep that, so
only the first one over the threshold attempts the wake.
--
Jens Axboe
next prev parent reply other threads:[~2024-03-27 15:45 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-26 18:42 [PATCHSET 0/4] Use io_wq_work_list for task_work Jens Axboe
2024-03-26 18:42 ` [PATCH 1/4] io_uring: use the right type for work_llist empty check Jens Axboe
2024-03-26 18:42 ` [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list Jens Axboe
2024-03-27 13:24 ` Pavel Begunkov
2024-03-27 15:45 ` Jens Axboe [this message]
2024-03-27 16:37 ` Jens Axboe
2024-03-27 17:28 ` Pavel Begunkov
2024-03-27 17:34 ` Jens Axboe
2024-03-26 18:42 ` [PATCH 3/4] io_uring: switch fallback work to io_wq_work_list Jens Axboe
2024-03-26 18:42 ` [PATCH 4/4] io_uring: switch normal task_work " Jens Axboe
2024-03-27 13:33 ` [PATCHSET 0/4] Use io_wq_work_list for task_work Pavel Begunkov
2024-03-27 16:36 ` Jens Axboe
2024-03-27 17:05 ` Jens Axboe
2024-03-27 18:04 ` Pavel Begunkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox