From: Dylan Yudaken <[email protected]>
To: "[email protected]" <[email protected]>,
"[email protected]" <[email protected]>,
"[email protected]" <[email protected]>
Cc: Kernel Team <[email protected]>
Subject: Re: [PATCH for-next v3 4/7] io_uring: add IORING_SETUP_DEFER_TASKRUN
Date: Tue, 30 Aug 2022 09:54:56 +0000 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On Mon, 2022-08-22 at 12:34 +0100, Pavel Begunkov wrote:
> On 8/19/22 13:19, Dylan Yudaken wrote:
> > Allow deferring async tasks until the user calls io_uring_enter(2)
> > with
> > the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
> > io_uring_setup time. This functionality requires that the later
> > io_uring_enter will be called from the same submission task, and
> > therefore
> > restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is
> > also
> > set.
>
> Looks ok, a couple of small comments below, but I don't see anything
> blocking it.
>
> > Being able to hand pick when tasks are run prevents the problem
> > where
> > there is current work to be done, however task work runs anyway.
> >
> > For example, a common workload would obtain a batch of CQEs, and
> > process
> > each one. Interrupting this to additional taskwork would add
> > latency but
> > not gain anything. If instead task work is deferred to just before
> > more
> > CQEs are obtained then no additional latency is added.
> >
> > The way this is implemented is by trying to keep task work local to
> > a
> > io_ring_ctx, rather than to the submission task. This is required,
> > as the
> > application will want to wake up only a single io_ring_ctx at a
> > time to
> > process work, and so the lists of work have to be kept separate.
> >
> > This has some other benefits like not having to check the task
> > continually
> > in handle_tw_list (and potentially unlocking/locking those), and
> > reducing
> > locks in the submit & process completions path.
> >
> > There are networking cases where using this option can reduce
> > request
> > latency by 50%. For example a contrived example using [1] where the
> > client
> > sends 2k data and receives the same data back while doing some
> > system
> > calls (to trigger task work) shows this reduction. The reason ends
> > up
> > being that if sending responses is delayed by processing task work,
> > then
> > the client side sits idle. Whereas reordering the sends first means
> > that
> > the client runs it's workload in parallel with the local task work.
>
> Quite contrived, for some it may cut latency in half but for others
> as easily increate it twofold. In any case, it's not a critique of
> the
> feature as it's optional, but rather raises a question whether we
> need to add some fairness / scheduling here.
>
> > [1]:
> > Using https://github.com/DylanZA/netbench/tree/defer_run
> > Client:
> > ./netbench --client_only 1 --control_port 10000 --host <host> --tx
> > "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --
> > workload 1000"
> > Server:
> > ./netbench --server_only 1 --control_port 10000 --rx "io_uring --
> > defer_taskrun 0 --workload 100" --rx "io_uring --defer_taskrun 1
> > --workload 100"
> >
> > Signed-off-by: Dylan Yudaken <[email protected]>
> > ---
>
> > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > index 53696dd90626..6572d2276750 100644
> > --- a/io_uring/io_uring.c
> > +++ b/io_uring/io_uring.c
> [...]
>
> > +int io_run_local_work(struct io_ring_ctx *ctx, bool locked)
> > +{
> > + struct llist_node *node;
> > + struct llist_node fake;
> > + struct llist_node *current_final = NULL;
> > + int ret;
> > +
> > + if (unlikely(ctx->submitter_task != current)) {
> > + if (locked)
> > + mutex_unlock(&ctx->uring_lock);
> > +
> > + /* maybe this is before any submissions */
> > + if (!ctx->submitter_task)
> > + return 0;
> > +
> > + return -EEXIST;
> > + }
> > +
> > + if (!locked)
> > + locked = mutex_trylock(&ctx->uring_lock);
> > +
> > + node = io_llist_xchg(&ctx->work_llist, &fake);
> > + ret = 0;
> > +again:
> > + while (node != current_final) {
> > + struct llist_node *next = node->next;
> > + struct io_kiocb *req = container_of(node, struct
> > io_kiocb,
> > +
> > io_task_work.node);
> > + prefetch(container_of(next, struct io_kiocb,
> > io_task_work.node));
> > + req->io_task_work.func(req, &locked);
> > + ret++;
> > + node = next;
> > + }
> > +
> > + if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
> > + atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings-
> > >sq_flags);
> > +
> > + node = io_llist_cmpxchg(&ctx->work_llist, &fake, NULL);
> > + if (node != &fake) {
> > + current_final = &fake;
> > + node = io_llist_xchg(&ctx->work_llist, &fake);
> > + goto again;
> > + }
> > +
> > + if (locked) {
> > + io_submit_flush_completions(ctx);
> > + mutex_unlock(&ctx->uring_lock);
> > + }
> > + return ret;
> > +}
>
> I was thinking about:
>
> int io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
> {
> locked = try_lock();
> }
>
> bool locked = false;
> io_run_local_work(ctx, *locked);
> if (locked)
> unlock();
>
> // or just as below when already holding it
> bool locked = true;
> io_run_local_work(ctx, *locked);
>
> Which would replace
>
> if (DEFER) {
> // we're assuming that it'll unlock
> io_run_local_work(true);
> } else {
> unlock();
> }
>
> with
>
> if (DEFER) {
> bool locked = true;
> io_run_local_work(&locked);
> }
> unlock();
>
> But anyway, it can be mulled later.
I think there is an easier way to clean it up if we allow an extra
unlock/lock in io_uring_enter (see below). Will do that in v4
>
>
> > -int io_run_task_work_sig(void)
> > +int io_run_task_work_sig(struct io_ring_ctx *ctx)
> > {
> > - if (io_run_task_work())
> > + if (io_run_task_work_ctx(ctx))
> > return 1;
> > if (task_sigpending(current))
> > return -EINTR;
> > @@ -2196,7 +2294,7 @@ static inline int
> > io_cqring_wait_schedule(struct io_ring_ctx *ctx,
> > unsigned long check_cq;
> >
> > /* make sure we run task_work before checking for signals
> > */
> > - ret = io_run_task_work_sig();
> > + ret = io_run_task_work_sig(ctx);
> > if (ret || io_should_wake(iowq))
> > return ret;
> >
> > @@ -2230,7 +2328,7 @@ static int io_cqring_wait(struct io_ring_ctx
> > *ctx, int min_events,
> > io_cqring_overflow_flush(ctx);
> > if (io_cqring_events(ctx) >= min_events)
> > return 0;
> > - if (!io_run_task_work())
> > + if (!io_run_task_work_ctx(ctx))
> > break;
> > } while (1);
> >
> > @@ -2573,6 +2671,9 @@ static __cold void io_ring_exit_work(struct
> > work_struct *work)
> > * as nobody else will be looking for them.
> > */
> > do {
> > + if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
> > + io_move_task_work_from_local(ctx);
> > +
> > while (io_uring_try_cancel_requests(ctx, NULL,
> > true))
> > cond_resched();
> >
> > @@ -2768,6 +2869,8 @@ static __cold bool
> > io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
> > }
> > }
> >
> > + if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
> > + ret |= io_run_local_work(ctx, false) > 0;
> > ret |= io_cancel_defer_files(ctx, task, cancel_all);
> > mutex_lock(&ctx->uring_lock);
> > ret |= io_poll_remove_all(ctx, task, cancel_all);
> > @@ -3057,10 +3160,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
> > int, fd, u32, to_submit,
> > }
> > if ((flags & IORING_ENTER_GETEVENTS) && ctx-
> > >syscall_iopoll)
> > goto iopoll_locked;
> > + if ((flags & IORING_ENTER_GETEVENTS) &&
> > + (ctx->flags & IORING_SETUP_DEFER_TASKRUN))
> > {
> > + int ret2 = io_run_local_work(ctx, true);
> > +
> > + if (unlikely(ret2 < 0))
> > + goto out;
>
> It's an optimisation and we don't have to handle errors here,
> let's ignore them and make it looking a bit better.
I'm not convinced about that - as then there is no way the application
will know it is trying to complete events on the wrong thread. Work
will just silently pile up instead.
That being said - with the changes below I can just get rid of this
code I think.
>
> > + goto getevents_ran_local;
> > + }
> > mutex_unlock(&ctx->uring_lock);
> > }
> > +
> > if (flags & IORING_ENTER_GETEVENTS) {
> > int ret2;
> > +
> > if (ctx->syscall_iopoll) {
> > /*
> > * We disallow the app entering
> > submit/complete with
> > @@ -3081,6 +3194,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
> > int, fd, u32, to_submit,
> > const sigset_t __user *sig;
> > struct __kernel_timespec __user *ts;
> >
> > + if (ctx->flags &
> > IORING_SETUP_DEFER_TASKRUN) {
>
> I think it should be in io_cqring_wait(), which calls it anyway
> in the beginning. Instead of
>
> do {
> io_cqring_overflow_flush(ctx);
> if (io_cqring_events(ctx) >= min_events)
> return 0;
> if (!io_run_task_work())
> break;
> } while (1);
>
> Let's have
>
> do {
> ret = io_run_task_work_ctx();
> // handle ret
> io_cqring_overflow_flush(ctx);
> if (io_cqring_events(ctx) >= min_events)
> return 0;
> } while (1);
I think that is ok.
The downside is that it adds an extra lock/unlock of the ctx in some
cases. I assume that will be neglegible?
>
> > + ret2 = io_run_local_work(ctx,
> > false);
> > + if (unlikely(ret2 < 0))
> > + goto getevents_out;
> > + }
> > +getevents_ran_local:
> > ret2 = io_get_ext_arg(flags, argp, &argsz,
> > &ts, &sig);
> > if (likely(!ret2)) {
> > min_complete = min(min_complete,
> > @@ -3090,6 +3209,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
> > fd, u32, to_submit,
> > }
> > }
> >
> > +getevents_out:
> > if (!ret) {
> > ret = ret2;
> >
> > @@ -3289,17 +3409,29 @@ static __cold int io_uring_create(unsigned
> > entries, struct io_uring_params *p,
> > if (ctx->flags & IORING_SETUP_SQPOLL) {
> > /* IPI related flags don't make sense with SQPOLL
> > */
> > if (ctx->flags & (IORING_SETUP_COOP_TASKRUN |
> > - IORING_SETUP_TASKRUN_FLAG))
> > + IORING_SETUP_TASKRUN_FLAG |
> > + IORING_SETUP_DEFER_TASKRUN))
>
> Sounds like we should also fail if SQPOLL is set, especially with
> the task check on the waiting side.
>
That is what this code is doing I think? Did I miss something?
> > goto err;
> > ctx->notify_method = TWA_SIGNAL_NO_IPI;
> > } else if (ctx->flags & IORING_SETUP_COOP_TASKRUN) {
> > ctx->notify_method = TWA_SIGNAL_NO_IPI;
> [...]
> > mutex_lock(&ctx->uring_lock);
> > ret = __io_uring_register(ctx, opcode, arg, nr_args);
> > diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> > index 2f73f83af960..a9fb115234af 100644
> > --- a/io_uring/io_uring.h
> > +++ b/io_uring/io_uring.h
> > @@ -26,7 +26,8 @@ enum {
> [...]
> > +static inline int io_run_task_work_unlock_ctx(struct io_ring_ctx
> > *ctx)
> > +{
> > + int ret;
> > +
> > + if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
> > + ret = io_run_local_work(ctx, true);
> > + } else {
> > + mutex_unlock(&ctx->uring_lock);
> > + ret = (int)io_run_task_work();
>
> Why do we need a cast? let's keep the return type same
Ok I'll update the return types here
Dylan
next prev parent reply other threads:[~2022-08-30 9:55 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-19 12:19 [PATCH for-next v3 0/7] io_uring: defer task work to when it is needed Dylan Yudaken
2022-08-19 12:19 ` [PATCH for-next v3 1/7] io_uring: remove unnecessary variable Dylan Yudaken
2022-08-19 12:19 ` [PATCH for-next v3 2/7] io_uring: introduce io_has_work Dylan Yudaken
2022-08-19 12:19 ` [PATCH for-next v3 3/7] io_uring: do not run task work at the start of io_uring_enter Dylan Yudaken
2022-08-19 12:19 ` [PATCH for-next v3 4/7] io_uring: add IORING_SETUP_DEFER_TASKRUN Dylan Yudaken
2022-08-22 11:34 ` Pavel Begunkov
2022-08-29 6:32 ` Hao Xu
2022-08-30 7:23 ` Dylan Yudaken
2022-08-30 7:54 ` Hao Xu
2022-08-30 9:54 ` Dylan Yudaken [this message]
2022-08-30 10:29 ` Pavel Begunkov
2022-08-30 13:19 ` Hao Xu
2022-08-30 13:34 ` Dylan Yudaken
2022-08-30 14:04 ` Hao Xu
2022-08-19 12:19 ` [PATCH for-next v3 5/7] io_uring: move io_eventfd_put Dylan Yudaken
2022-08-19 12:19 ` [PATCH for-next v3 6/7] io_uring: signal registered eventfd to process deferred task work Dylan Yudaken
2022-08-19 12:19 ` [PATCH for-next v3 7/7] io_uring: trace local task work run Dylan Yudaken
2022-08-29 7:01 ` [PATCH for-next v3 0/7] io_uring: defer task work to when it is needed Hao Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4b5d0d7f259b799de4cdc869a34827f1b74d43f9.camel@fb.com \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox