Re: [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list

public inbox for [email protected]
 help / color / mirror / Atom feed

From: Jens Axboe <[email protected]>
To: Pavel Begunkov <[email protected]>, [email protected]
Subject: Re: [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list
Date: Wed, 27 Mar 2024 09:45:40 -0600	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 3/27/24 7:24 AM, Pavel Begunkov wrote:
> On 3/26/24 18:42, Jens Axboe wrote:
>> Lockless lists may be handy for some things, but they mean that items
>> are in the reverse order as we can only add to the head of the list.
>> That in turn means that iterating items on the list needs to reverse it
>> first, if it's sensitive to ordering between items on the list.
>>
>> Switch the DEFER_TASKRUN work list from an llist to a normal
>> io_wq_work_list, and protect it with a lock. Then we can get rid of the
>> manual reversing of the list when running it, which takes considerable
>> cycles particularly for bursty task_work additions.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>>   include/linux/io_uring_types.h |  11 ++--
>>   io_uring/io_uring.c            | 117 ++++++++++++---------------------
>>   io_uring/io_uring.h            |   4 +-
>>   3 files changed, 51 insertions(+), 81 deletions(-)
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index aeb4639785b5..e51bf15196e4 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -329,7 +329,9 @@ struct io_ring_ctx {
>>        * regularly bounce b/w CPUs.
> 
> ...
> 
>> -static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>> +static inline void io_req_local_work_add(struct io_kiocb *req, unsigned tw_flags)
>>   {
>>       struct io_ring_ctx *ctx = req->ctx;
>> -    unsigned nr_wait, nr_tw, nr_tw_prev;
>> -    struct llist_node *head;
>> +    unsigned nr_wait, nr_tw;
>> +    unsigned long flags;
>>         /* See comment above IO_CQ_WAKE_INIT */
>>       BUILD_BUG_ON(IO_CQ_WAKE_FORCE <= IORING_MAX_CQ_ENTRIES);
>>         /*
>> -     * We don't know how many reuqests is there in the link and whether
>> +     * We don't know how many requests is there in the link and whether
>>        * they can even be queued lazily, fall back to non-lazy.
>>        */
>>       if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK))
>> -        flags &= ~IOU_F_TWQ_LAZY_WAKE;
>> -
>> -    head = READ_ONCE(ctx->work_llist.first);
>> -    do {
>> -        nr_tw_prev = 0;
>> -        if (head) {
>> -            struct io_kiocb *first_req = container_of(head,
>> -                            struct io_kiocb,
>> -                            io_task_work.node);
>> -            /*
>> -             * Might be executed at any moment, rely on
>> -             * SLAB_TYPESAFE_BY_RCU to keep it alive.
>> -             */
>> -            nr_tw_prev = READ_ONCE(first_req->nr_tw);
>> -        }
>> -
>> -        /*
>> -         * Theoretically, it can overflow, but that's fine as one of
>> -         * previous adds should've tried to wake the task.
>> -         */
>> -        nr_tw = nr_tw_prev + 1;
>> -        if (!(flags & IOU_F_TWQ_LAZY_WAKE))
>> -            nr_tw = IO_CQ_WAKE_FORCE;
> 
> Aren't you just killing the entire IOU_F_TWQ_LAZY_WAKE handling?
> It's assigned to IO_CQ_WAKE_FORCE so that it passes the check
> before wake_up below.

Yeah I messed that one up, did fix that one yesterday before sending it
out.

>> +        tw_flags &= ~IOU_F_TWQ_LAZY_WAKE;
>>   -        req->nr_tw = nr_tw;
>> -        req->io_task_work.node.next = head;
>> -    } while (!try_cmpxchg(&ctx->work_llist.first, &head,
>> -                  &req->io_task_work.node));
>> +    spin_lock_irqsave(&ctx->work_lock, flags);
>> +    wq_list_add_tail(&req->io_task_work.node, &ctx->work_list);
>> +    nr_tw = ++ctx->work_items;
>> +    spin_unlock_irqrestore(&ctx->work_lock, flags);
> 
> smp_mb(), see the comment below, and fwiw "_after_atomic" would not
> work.

For this one, I think all we need to do is have the wq_list_empty()
check be fully stable. If we read:

nr_wait = atomic_read(&ctx->cq_wait_nr);

right before a waiter does:

atomic_set(&ctx->cq_wait_nr, foo);
set_current_state(TASK_INTERRUPTIBLE);

then we need to ensure that the "I have work" check in
io_cqring_wait_schedule() sees the work. The spin_unlock() has release
semantics, and the current READ_ONCE() for work check sbould be enough,
no?

>>       /*
>>        * cmpxchg implies a full barrier, which pairs with the barrier
>> @@ -1289,7 +1254,7 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>>        * is similar to the wait/wawke task state sync.
>>        */
>>   -    if (!head) {
>> +    if (nr_tw == 1) {
>>           if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
>>               atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
>>           if (ctx->has_evfd)
>> @@ -1297,13 +1262,8 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>>       }
>>         nr_wait = atomic_read(&ctx->cq_wait_nr);
>> -    /* not enough or no one is waiting */
>> -    if (nr_tw < nr_wait)
>> -        return;
>> -    /* the previous add has already woken it up */
>> -    if (nr_tw_prev >= nr_wait)
>> -        return;
>> -    wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
>> +    if (nr_tw >= nr_wait)
>> +        wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
> 
> IIUC, you're removing a very important optimisation, and I
> don't understand why'd you do that. It was waking up only when
> it's changing from "don't need to wake" to "have to be woken up",
> just once per splicing the list on the waiting side.
> 
> It seems like this one will keep pounding with wake_up_state for
> every single tw queued after @nr_wait, which quite often just 1.

Agree, that was just a silly oversight. I brought that back now.
Apparently doesn't hit anything here, at least not to the extent that I
saw it in testing. But it is a good idea and we should keep that, so
only the first one over the threshold attempts the wake.

-- 
Jens Axboe

next prev parent reply	other threads:[~2024-03-27 15:45 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-26 18:42 [PATCHSET 0/4] Use io_wq_work_list for task_work Jens Axboe
2024-03-26 18:42 ` [PATCH 1/4] io_uring: use the right type for work_llist empty check Jens Axboe
2024-03-26 18:42 ` [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list Jens Axboe
2024-03-27 13:24   ` Pavel Begunkov
2024-03-27 15:45     ` Jens Axboe [this message]
2024-03-27 16:37       ` Jens Axboe
2024-03-27 17:28         ` Pavel Begunkov
2024-03-27 17:34           ` Jens Axboe
2024-03-26 18:42 ` [PATCH 3/4] io_uring: switch fallback work to io_wq_work_list Jens Axboe
2024-03-26 18:42 ` [PATCH 4/4] io_uring: switch normal task_work " Jens Axboe
2024-03-27 13:33 ` [PATCHSET 0/4] Use io_wq_work_list for task_work Pavel Begunkov
2024-03-27 16:36   ` Jens Axboe
2024-03-27 17:05     ` Jens Axboe
2024-03-27 18:04     ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox