public inbox for [email protected]
 help / color / mirror / Atom feed
* [GIT PULL] io_uring thread worker change
@ 2021-02-26 22:04 Jens Axboe
  2021-02-27  2:48 ` Linus Torvalds
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Jens Axboe @ 2021-02-26 22:04 UTC (permalink / raw)
  To: torvalds; +Cc: io-uring

Hi Linus,

As mentioned last week, I'm sending the series on converting the io-wq
workers to be forked off the tasks in question instead of being kernel
threads that assume various bits of the original task identity. This is
on top of the for-5.12/io_uring branch I sent out yesterday, which I
think you've merged but just haven't pushed out yet.

This kills > 400 lines of code from io_uring/io-wq, and it's the worst
part of the code. We've had several bugs in this area, and the worry is
always that we could be missing some pieces for file types doing unusual
things (recent /dev/tty example comes to mind, userfaultfd reads
installing file descriptors is another fun one... - both of which need
special handling, and I bet it's not the last weird oddity we'll find).
With these identical workers, we can have full confidence that we're
never missing anything. That, in itself, is a huge win. Outside of that,
it's also more efficient since we're not wasting space and code on
tracking state, or switching between different states.

I'm sure we're going to find little things to patch up after this
series, but testing has been pretty thorough, from the usual regression
suite to production. Any issue that may crop up should be manageable.
There's also a nice series of further reductions we can do on top of
this, but I wanted to get the meat of it out sooner rather than later.
The general worry here isn't that it's fundamentally broken. Most of the
little issues we've found over the last week have been related to just
changes in how thread startup/exit is done, since that's the main
difference between using kthreads and these kinds of threads. In fact,
if all goes according to plan, I want to get this into the 5.10 and 5.11
stable branches as well.

That said, the changes outside of io_uring/io-wq are:

- arch setup, simple one-liner to each arch copy_thread()
  implementation.

- Removal of net and proc restrictions for io_uring, they are no longer
  needed or useful.

I'm leaving this to you if you're OK pulling it for this release, I'm
obviously in favor of doing so since I'm sending it to you. It'll help
me sleep better at night, and provide a *much* saner base for async
offload going forward. A bit of potential pain in the rc cycle is
totally worth it imho.


The following changes since commit b6c23dd5a483174f386e4c2e1711d9532e090c00:

  io_uring: run task_work on io_uring_register() (2021-02-21 17:18:56 -0700)

are available in the Git repository at:

  git://git.kernel.dk/linux-block.git tags/io_uring-worker.v3-2021-02-25

for you to fetch changes up to d6ce7f6761bf6d669d9c74ec5d3bd1bfe92380c5:

  io-wq: remove now unused IO_WQ_BIT_ERROR (2021-02-25 10:19:43 -0700)

----------------------------------------------------------------
io_uring-worker.v3-2021-02-25

----------------------------------------------------------------
Jens Axboe (31):
      Merge branch 'for-5.12/io_uring' into io_uring-worker.v3
      io_uring: remove the need for relying on an io-wq fallback worker
      io-wq: don't create any IO workers upfront
      io_uring: disable io-wq attaching
      io-wq: get rid of wq->use_refs
      io_uring: tie async worker side to the task context
      io-wq: don't pass 'wqe' needlessly around
      arch: setup PF_IO_WORKER threads like PF_KTHREAD
      kernel: treat PF_IO_WORKER like PF_KTHREAD for ptrace/signals
      io-wq: fork worker threads from original task
      io-wq: worker idling always returns false
      io_uring: remove any grabbing of context
      io_uring: remove io_identity
      io-wq: only remove worker from free_list, if it was there
      io-wq: make io_wq_fork_thread() available to other users
      io_uring: move SQPOLL thread io-wq forked worker
      Revert "proc: don't allow async path resolution of /proc/thread-self components"
      Revert "proc: don't allow async path resolution of /proc/self components"
      net: remove cmsg restriction from io_uring based send/recvmsg calls
      io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS
      io-wq: remove nr_process accounting
      io_uring: cleanup ->user usage
      arch: ensure parisc/powerpc handle PF_IO_WORKER in copy_thread()
      io_uring: ensure io-wq context is always destroyed for tasks
      io-wq: fix races around manager/worker creation and task exit
      io-wq: fix race around io_worker grabbing
      io-wq: make buffered file write hashed work map per-ctx
      io_uring: ensure SQPOLL startup is triggered before error shutdown
      io-wq: improve manager/worker handling over exec
      io_uring: fix SQPOLL thread handling over exec
      io-wq: remove now unused IO_WQ_BIT_ERROR

 arch/alpha/kernel/process.c      |    2 +-
 arch/arc/kernel/process.c        |    2 +-
 arch/arm/kernel/process.c        |    2 +-
 arch/arm64/kernel/process.c      |    2 +-
 arch/csky/kernel/process.c       |    2 +-
 arch/h8300/kernel/process.c      |    2 +-
 arch/hexagon/kernel/process.c    |    2 +-
 arch/ia64/kernel/process.c       |    2 +-
 arch/m68k/kernel/process.c       |    2 +-
 arch/microblaze/kernel/process.c |    2 +-
 arch/mips/kernel/process.c       |    2 +-
 arch/nds32/kernel/process.c      |    2 +-
 arch/nios2/kernel/process.c      |    2 +-
 arch/openrisc/kernel/process.c   |    2 +-
 arch/parisc/kernel/process.c     |    2 +-
 arch/powerpc/kernel/process.c    |    2 +-
 arch/riscv/kernel/process.c      |    2 +-
 arch/s390/kernel/process.c       |    2 +-
 arch/sh/kernel/process_32.c      |    2 +-
 arch/sparc/kernel/process_32.c   |    2 +-
 arch/sparc/kernel/process_64.c   |    2 +-
 arch/um/kernel/process.c         |    2 +-
 arch/x86/kernel/process.c        |    2 +-
 arch/xtensa/kernel/process.c     |    2 +-
 fs/io-wq.c                       |  621 ++++++++---------
 fs/io-wq.h                       |   35 +-
 fs/io_uring.c                    | 1088 +++++++++++-------------------
 fs/proc/self.c                   |    7 -
 fs/proc/thread_self.c            |    7 -
 include/linux/io_uring.h         |   22 +-
 include/linux/net.h              |    3 -
 include/linux/sched.h            |    3 +
 include/uapi/linux/io_uring.h    |    1 +
 kernel/ptrace.c                  |    2 +-
 kernel/signal.c                  |    4 +-
 net/ipv4/af_inet.c               |    1 -
 net/ipv6/af_inet6.c              |    1 -
 net/socket.c                     |   10 -
 38 files changed, 716 insertions(+), 1137 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] io_uring thread worker change
  2021-02-26 22:04 [GIT PULL] io_uring thread worker change Jens Axboe
@ 2021-02-27  2:48 ` Linus Torvalds
  2021-02-27  4:27   ` Jens Axboe
  2021-02-27 16:38 ` Linus Torvalds
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2021-02-27  2:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

Reading through this once..

I'll read through it another time tomorrow before merging it, but on
the whole it looks good. The concept certainly is the right now, but I
have a few questions about some of the details..

On Fri, Feb 26, 2021 at 2:04 PM Jens Axboe <[email protected]> wrote:
>
>       io-wq: fix races around manager/worker creation and task exit

Where did that

+        __set_current_state(TASK_RUNNING);

in the patch come from? That looks odd.

Is it because io_wq_manager() does a

        set_current_state(TASK_INTERRUPTIBLE);

at one point? Regardless, that code looks very very strange.

>       io_uring: remove the need for relying on an io-wq fallback worker

This is coded really oddly.

+ do {
+     head = NULL;
+     work = READ_ONCE(ctx->exit_task_work);
+ } while (cmpxchg(&ctx->exit_task_work, work, head) != work);

Whaa?

If you want to write

    work = xchg(&ctx->exit_task_work, NULL);

then just do that. Instead of an insane cmpxchg loop that isn't even
well-written.

I say that it isn't even well-written, because when you really do want
a cmpxchg loop, then realize that cmpxchg() returns the old value, so
the proper way to write it is

    new = whatever;
    expect = READ_ONCE(ctx->exit_task_work);
    for (;;) {
          new->next = expect;  // Mumble mumble - this is why you want
the cmpxchg
          was = cmpxchg(&ctx->exit_task_work, expect, new);
          if (was == expect)
                  break;
          expect = was;
    }

IOW, that READ_ONCE() of the old value should happen only once - and
isn't worth a READ_ONCE() in the first place. There's nothing "read
once" about it.

But as mentioned, if all you want is an atomic "get and replace with
NULL", then just a plain "xchg()" is what you should do.

Yes, that cmpxchg loop _works_, but it is really confusing to read
that way, when clearly you don't actually need or really want a
cmpxchg.

(The other cmpxchg loop in the same patch is needed, but does that
unnecessary "re-read the value that cmpxchg just returned").

Please explain.

              Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] io_uring thread worker change
  2021-02-27  2:48 ` Linus Torvalds
@ 2021-02-27  4:27   ` Jens Axboe
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2021-02-27  4:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring

On 2/26/21 7:48 PM, Linus Torvalds wrote:
> Reading through this once..
> 
> I'll read through it another time tomorrow before merging it, but on
> the whole it looks good. The concept certainly is the right now, but I
> have a few questions about some of the details..

Thanks, I do think it's the right path. As mentioned there's a few minor
rough edges and cases that we've fixed since, and those are soaking as
we speak to verify that they are good. But nothing really major.

> On Fri, Feb 26, 2021 at 2:04 PM Jens Axboe <[email protected]> wrote:
>>
>>       io-wq: fix races around manager/worker creation and task exit
> 
> Where did that
> 
> +        __set_current_state(TASK_RUNNING);
> 
> in the patch come from? That looks odd.
> 
> Is it because io_wq_manager() does a
> 
>         set_current_state(TASK_INTERRUPTIBLE);
> 
> at one point? Regardless, that code looks very very strange.

Right, it's for both out of necessity for worker creation (since it
blocks), but also so that we'll run the loop again when we return.
Basically:

- Set non-runnable state
- Check condition
- schedule

and the schedule will be a no-op, so we'll do that loop again and
finally then schedule if nothing woke us up. If we don't end up needing
to fork a worker, then the stateremains TASK_INTERRUPTIBLE and we go to
sleep as originally planned.

>>       io_uring: remove the need for relying on an io-wq fallback worker
> 
> This is coded really oddly.
> 
> + do {
> +     head = NULL;
> +     work = READ_ONCE(ctx->exit_task_work);
> + } while (cmpxchg(&ctx->exit_task_work, work, head) != work);
> 
> Whaa?
> 
> If you want to write
> 
>     work = xchg(&ctx->exit_task_work, NULL);
> 
> then just do that. Instead of an insane cmpxchg loop that isn't even
> well-written.
> 
> I say that it isn't even well-written, because when you really do want
> a cmpxchg loop, then realize that cmpxchg() returns the old value, so
> the proper way to write it is
> 
>     new = whatever;
>     expect = READ_ONCE(ctx->exit_task_work);
>     for (;;) {
>           new->next = expect;  // Mumble mumble - this is why you want
> the cmpxchg
>           was = cmpxchg(&ctx->exit_task_work, expect, new);
>           if (was == expect)
>                   break;
>           expect = was;
>     }
> 
> IOW, that READ_ONCE() of the old value should happen only once - and
> isn't worth a READ_ONCE() in the first place. There's nothing "read
> once" about it.
> 
> But as mentioned, if all you want is an atomic "get and replace with
> NULL", then just a plain "xchg()" is what you should do.
> 
> Yes, that cmpxchg loop _works_, but it is really confusing to read
> that way, when clearly you don't actually need or really want a
> cmpxchg.
> 
> (The other cmpxchg loop in the same patch is needed, but does that
> unnecessary "re-read the value that cmpxchg just returned").
> 
> Please explain.

You are totally right, and Eric and Pavel actually commented on that
too. I decided to just leave it for a cleanup patch, because it should
just be an xchg(). I didn't deem it important enough to go back and fix
the original patch, since it isn't broken, it's just not as well written
or as optimal as it could be. It'll get changed to xchg().

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] io_uring thread worker change
  2021-02-26 22:04 [GIT PULL] io_uring thread worker change Jens Axboe
  2021-02-27  2:48 ` Linus Torvalds
@ 2021-02-27 16:38 ` Linus Torvalds
  2021-02-27 19:47   ` Jens Axboe
  2021-02-27 16:41 ` pr-tracker-bot
  2021-03-11 10:43 ` Backporting to stable... " Stefan Metzmacher
  3 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2021-02-27 16:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Fri, Feb 26, 2021 at 2:04 PM Jens Axboe <[email protected]> wrote:
>
>   git://git.kernel.dk/linux-block.git tags/io_uring-worker.v3-2021-02-25

Ok, I've pulled it, because it's clearly the right thing to do. It
would have been nicer had it been ready earlier in the merge window,
but delaying it will only make things worse, I suspect.

             Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] io_uring thread worker change
  2021-02-26 22:04 [GIT PULL] io_uring thread worker change Jens Axboe
  2021-02-27  2:48 ` Linus Torvalds
  2021-02-27 16:38 ` Linus Torvalds
@ 2021-02-27 16:41 ` pr-tracker-bot
  2021-03-11 10:43 ` Backporting to stable... " Stefan Metzmacher
  3 siblings, 0 replies; 8+ messages in thread
From: pr-tracker-bot @ 2021-02-27 16:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: torvalds, io-uring

The pull request you sent on Fri, 26 Feb 2021 15:04:20 -0700:

> git://git.kernel.dk/linux-block.git tags/io_uring-worker.v3-2021-02-25

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/5695e51619745d4fe3ec2506a2f0cd982c5e27a4

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] io_uring thread worker change
  2021-02-27 16:38 ` Linus Torvalds
@ 2021-02-27 19:47   ` Jens Axboe
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2021-02-27 19:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring

On 2/27/21 9:38 AM, Linus Torvalds wrote:
> On Fri, Feb 26, 2021 at 2:04 PM Jens Axboe <[email protected]> wrote:
>>
>>   git://git.kernel.dk/linux-block.git tags/io_uring-worker.v3-2021-02-25
> 
> Ok, I've pulled it, because it's clearly the right thing to do. It
> would have been nicer had it been ready earlier in the merge window,
> but delaying it will only make things worse, I suspect.

Oh I agree, in fact I would've loved to have done this releases
earlier... But better late than never. Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Backporting to stable... Re: [GIT PULL] io_uring thread worker change
  2021-02-26 22:04 [GIT PULL] io_uring thread worker change Jens Axboe
                   ` (2 preceding siblings ...)
  2021-02-27 16:41 ` pr-tracker-bot
@ 2021-03-11 10:43 ` Stefan Metzmacher
  2021-03-11 18:46   ` Jens Axboe
  3 siblings, 1 reply; 8+ messages in thread
From: Stefan Metzmacher @ 2021-03-11 10:43 UTC (permalink / raw)
  To: Jens Axboe, torvalds; +Cc: io-uring

Hi Jens,

> I'm sure we're going to find little things to patch up after this
> series, but testing has been pretty thorough, from the usual regression
> suite to production. Any issue that may crop up should be manageable.
> There's also a nice series of further reductions we can do on top of
> this, but I wanted to get the meat of it out sooner rather than later.
> The general worry here isn't that it's fundamentally broken. Most of the
> little issues we've found over the last week have been related to just
> changes in how thread startup/exit is done, since that's the main
> difference between using kthreads and these kinds of threads. In fact,
> if all goes according to plan, I want to get this into the 5.10 and 5.11
> stable branches as well.

That would mean that IORING_FEAT_SQPOLL_NONFIXED would be implicitly be backported
from 5.11 to 5.10, correct?

I'm wondering if I can advice people to move to 5.10 (as it's an lts release)
in order to get a kernel that is most likely very useful to use in combination
with Samba's drafted usage of io_uring, where I'd like to use IORING_FEAT_SQPOLL_NONFIXED
and IORING_FEAT_NATIVE_WORKERS in order to use SENDMSG/RECVMSG with msg_control buffers (where
the control buffers may reference file descriptors).

Thanks!
metze


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Backporting to stable... Re: [GIT PULL] io_uring thread worker change
  2021-03-11 10:43 ` Backporting to stable... " Stefan Metzmacher
@ 2021-03-11 18:46   ` Jens Axboe
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2021-03-11 18:46 UTC (permalink / raw)
  To: Stefan Metzmacher, torvalds; +Cc: io-uring

On 3/11/21 3:43 AM, Stefan Metzmacher wrote:
> Hi Jens,
> 
>> I'm sure we're going to find little things to patch up after this
>> series, but testing has been pretty thorough, from the usual regression
>> suite to production. Any issue that may crop up should be manageable.
>> There's also a nice series of further reductions we can do on top of
>> this, but I wanted to get the meat of it out sooner rather than later.
>> The general worry here isn't that it's fundamentally broken. Most of the
>> little issues we've found over the last week have been related to just
>> changes in how thread startup/exit is done, since that's the main
>> difference between using kthreads and these kinds of threads. In fact,
>> if all goes according to plan, I want to get this into the 5.10 and 5.11
>> stable branches as well.
> 
> That would mean that IORING_FEAT_SQPOLL_NONFIXED would be implicitly
> be backported from 5.11 to 5.10, correct?

Right, that would happen by default if we moved the new worker code back
to 5.10 and 5.11.

> I'm wondering if I can advice people to move to 5.10 (as it's an lts
> release) in order to get a kernel that is most likely very useful to
> use in combination with Samba's drafted usage of io_uring, where I'd
> like to use IORING_FEAT_SQPOLL_NONFIXED and IORING_FEAT_NATIVE_WORKERS
> in order to use SENDMSG/RECVMSG with msg_control buffers (where the
> control buffers may reference file descriptors).

Understandable! It's worth nothing that 5.10 would also need a backport
of the TIF_NOTIFY_SIGNAL change - could be done without, but then we'd
have diverging code bases to some degree, which would be something I'd
love to avoid.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-03-11 18:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-02-26 22:04 [GIT PULL] io_uring thread worker change Jens Axboe
2021-02-27  2:48 ` Linus Torvalds
2021-02-27  4:27   ` Jens Axboe
2021-02-27 16:38 ` Linus Torvalds
2021-02-27 19:47   ` Jens Axboe
2021-02-27 16:41 ` pr-tracker-bot
2021-03-11 10:43 ` Backporting to stable... " Stefan Metzmacher
2021-03-11 18:46   ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox