[GIT PULL] io_uring fixes for 5.12-rc2

public inbox for [email protected]
 help / color / mirror / Atom feed

* [GIT PULL] io_uring fixes for 5.12-rc2
@ 2021-03-05 18:09 Jens Axboe
  2021-03-05 20:54 ` Linus Torvalds
  2021-03-05 21:58 ` pr-tracker-bot
  0 siblings, 2 replies; 8+ messages in thread
From: Jens Axboe @ 2021-03-05 18:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring

Hi Linus,

A bit of a mix between fallout from the worker change,
cleanups/reductions now possible from that change, and fixes in general.
In detail:

- Fully serialize manager and worker creation, fixing races due to that.

- Clean up some naming that had gone stale.

- SQPOLL fixes.

- Fix race condition around task_work rework that went into this merge
  window.

- Implement unshare. Used for when the original task does unshare(2) or
  setuid/seteuid and friends, drops the original workers and forks new
  ones.

- Drop the only remaining piece of state shuffling we had left, which
  was cred. Move it into issue instead, and we can drop all of that code
  too.

- Kill f_op->flush() usage. That was such a nasty hack that we had out
  of necessity, we no longer need it.

- Following from ->flush() removal, we can also drop various bits of ctx
  state related to SQPOLL and cancelations.

- Fix an issue with IOPOLL retry, which originally was fallout from a
  filemap change (removing iov_iter_revert()), but uncovered an issue
  with iovec re-import too late.

- Fix an issue with system suspend.

- Use xchg() for fallback work, instead of cmpxchg().

- Properly destroy io-wq on exec.

- Add create_io_thread() core helper, and use that in io-wq and
  io_uring. This allows us to remove various silly completion events
  related to thread setup.

- A few error handling fixes.

This should be the grunt of fixes necessary for the new workers, next
week should be quieter. We've got a pending series from Pavel on
cancelations, and how tasks and rings are indexed. Outside of that,
should just be minor fixes. Even with these fixes, we're still killing a
net ~80 lines.

Please pull!


The following changes since commit fe07bfda2fb9cdef8a4d4008a409bb02f35f1bd8:

  Linux 5.12-rc1 (2021-02-28 16:05:19 -0800)

are available in the Git repository at:

  git://git.kernel.dk/linux-block.git tags/io_uring-5.12-2021-03-05

for you to fetch changes up to e45cff58858883290c98f65d409839a7295c95f3:

  io_uring: don't restrict issue_flags for io_openat (2021-03-05 09:52:29 -0700)

----------------------------------------------------------------
io_uring-5.12-2021-03-05

----------------------------------------------------------------
Jens Axboe (27):
      io-wq: wait for worker startup when forking a new one
      io-wq: have manager wait for all workers to exit
      io-wq: don't ask for a new worker if we're exiting
      io-wq: rename wq->done completion to wq->started
      io-wq: wait for manager exit on wq destroy
      io-wq: fix double put of 'wq' in error path
      io_uring: SQPOLL stop error handling fixes
      io_uring: don't use complete_all() on SQPOLL thread exit
      io-wq: provide an io_wq_put_and_exit() helper
      io_uring: fix race condition in task_work add and clear
      io_uring: remove unused argument 'tsk' from io_req_caches_free()
      io_uring: kill unnecessary REQ_F_WORK_INITIALIZED checks
      io_uring: move cred assignment into io_issue_sqe()
      io_uring: kill unnecessary io_run_ctx_fallback() in io_ring_exit_work()
      io_uring: kill io_uring_flush()
      io_uring: ensure that SQPOLL thread is started for exit
      io_uring: ignore double poll add on the same waitqueue head
      io-wq: fix error path leak of buffered write hash map
      io_uring: fix -EAGAIN retry with IOPOLL
      io_uring: ensure that threads freeze on suspend
      io-wq: ensure all pending work is canceled on exit
      kernel: provide create_io_thread() helper
      io_uring: move to using create_io_thread()
      io_uring: don't keep looping for more events if we can't flush overflow
      io_uring: clear IOCB_WAITQ for non -EIOCBQUEUED return
      io-wq: kill hashed waitqueue before manager exits
      io_uring: make SQPOLL thread parking saner

Pavel Begunkov (14):
      io_uring: run fallback on cancellation
      io_uring: warn on not destroyed io-wq
      io_uring: destroy io-wq on exec
      io_uring: fix __tctx_task_work() ctx race
      io_uring: replace cmpxchg in fallback with xchg
      io_uring: kill sqo_dead and sqo submission halting
      io_uring: remove sqo_task
      io_uring: choose right tctx->io_wq for try cancel
      io_uring: inline io_req_clean_work()
      io_uring: inline __io_queue_async_work()
      io_uring: remove extra in_idle wake up
      io_uring: cancel-match based on flags
      io_uring: reliably cancel linked timeouts
      io_uring: don't restrict issue_flags for io_openat

 fs/io-wq.c                 | 261 +++++++++++------------
 fs/io-wq.h                 |   5 +-
 fs/io_uring.c              | 500 +++++++++++++++++++--------------------------
 include/linux/io_uring.h   |   2 +-
 include/linux/sched/task.h |   2 +
 kernel/fork.c              |  30 +++
 6 files changed, 361 insertions(+), 439 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] io_uring fixes for 5.12-rc2
  2021-03-05 18:09 [GIT PULL] io_uring fixes for 5.12-rc2 Jens Axboe
@ 2021-03-05 20:54 ` Linus Torvalds
  2021-03-05 21:58   ` Jens Axboe
  2021-03-05 21:58 ` pr-tracker-bot
  1 sibling, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2021-03-05 20:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Fri, Mar 5, 2021 at 10:09 AM Jens Axboe <[email protected]> wrote:
>
> - Implement unshare. Used for when the original task does unshare(2) or
>   setuid/seteuid and friends, drops the original workers and forks new
>   ones.

This explanation makes no sense to me, and I don't see any commit that
would remotely do what I parse that as doing.

I left it in the merge message, but would like you to explain what you
actually meant...

                Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] io_uring fixes for 5.12-rc2
  2021-03-05 20:54 ` Linus Torvalds
@ 2021-03-05 21:58   ` Jens Axboe
  2021-03-05 23:03     ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2021-03-05 21:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring

On 3/5/21 1:54 PM, Linus Torvalds wrote:
> On Fri, Mar 5, 2021 at 10:09 AM Jens Axboe <[email protected]> wrote:
>>
>> - Implement unshare. Used for when the original task does unshare(2) or
>>   setuid/seteuid and friends, drops the original workers and forks new
>>   ones.
> 
> This explanation makes no sense to me, and I don't see any commit that
> would remotely do what I parse that as doing.

Hah, good catch... I actually wrote this up as I sent out the series for
reviews the other day, and I dropped the unshare bit as it was
contentious and I think there are better ways to do it. But I obviously
forgot to drop it from the commit message.

But since I have you here, would love to hear your thoughts. For the
setuid/seteuid, io_uring actually has a mechanism for this already, so I
don't believe we _need_ to do anything. You can register creds and use a
specific one for requests, this is what samba does for example.

But it pertains to the problem in general, which is how do we handle
when the original task sets up a ring and then goes and does
unshare(FILES) or whatever. It then submits a new request that just
happens to go async, which is oblivious to this fact. Same thing that
would happen in userspace if you created a thread pool and then did
unshare, and then had your existing threads handle work for you. Except
here it just kind of happens implicitly, and I can see how that would be
confusing or even problematic.

The patch mentioned above basically just did a cancel across that, like
we do for exec. If you go async after that, you get a new manager then
new threads, which takes care of it. But it's also very brutal and not
necessarily all that useful for the application.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] io_uring fixes for 5.12-rc2
  2021-03-05 18:09 [GIT PULL] io_uring fixes for 5.12-rc2 Jens Axboe
  2021-03-05 20:54 ` Linus Torvalds
@ 2021-03-05 21:58 ` pr-tracker-bot
  1 sibling, 0 replies; 8+ messages in thread
From: pr-tracker-bot @ 2021-03-05 21:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, io-uring

The pull request you sent on Fri, 5 Mar 2021 11:09:09 -0700:

> git://git.kernel.dk/linux-block.git tags/io_uring-5.12-2021-03-05

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/f292e8730a349577aaf13635399b39a50b8f5910

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] io_uring fixes for 5.12-rc2
  2021-03-05 21:58   ` Jens Axboe
@ 2021-03-05 23:03     ` Linus Torvalds
  2021-03-06 16:25       ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2021-03-05 23:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Fri, Mar 5, 2021 at 1:58 PM Jens Axboe <[email protected]> wrote:
>
> But it pertains to the problem in general, which is how do we handle
> when the original task sets up a ring and then goes and does
> unshare(FILES) or whatever. It then submits a new request that just
> happens to go async, which is oblivious to this fact. Same thing that
> would happen in userspace if you created a thread pool and then did
> unshare, and then had your existing threads handle work for you. Except
> here it just kind of happens implicitly, and I can see how that would be
> confusing or even problematic.

Honestly, I'd aim for a "keep a cred per request".  And if that means
that then the async worker thread has to check that it matches the
cred it already has, then so be it.

Otherwise, you'll always have odd situations where "synchronous
completion gets different results than something that was kicked off
to an async thread".

But I don't think this has anything to do with unshare() per se. I
think this is a generic "hey, the process can change creds between
ring creation - and thread creation - and submission of an io_ring
command".

No? Or am I misunderstanding what you're thinking of?

        Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] io_uring fixes for 5.12-rc2
  2021-03-05 23:03     ` Linus Torvalds
@ 2021-03-06 16:25       ` Jens Axboe
  2021-03-06 21:14         ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2021-03-06 16:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring

On 3/5/21 4:03 PM, Linus Torvalds wrote:
> On Fri, Mar 5, 2021 at 1:58 PM Jens Axboe <[email protected]> wrote:
>>
>> But it pertains to the problem in general, which is how do we handle
>> when the original task sets up a ring and then goes and does
>> unshare(FILES) or whatever. It then submits a new request that just
>> happens to go async, which is oblivious to this fact. Same thing that
>> would happen in userspace if you created a thread pool and then did
>> unshare, and then had your existing threads handle work for you. Except
>> here it just kind of happens implicitly, and I can see how that would be
>> confusing or even problematic.
> 
> Honestly, I'd aim for a "keep a cred per request".  And if that means
> that then the async worker thread has to check that it matches the
> cred it already has, then so be it.

Agree, which is actually not that bad and can be done without having
io-wq deal with it. See below.

> Otherwise, you'll always have odd situations where "synchronous
> completion gets different results than something that was kicked off
> to an async thread".

Exactly, that was my main concern. It violates the principle of least
surprise, and I didn't like it.

> But I don't think this has anything to do with unshare() per se. I
> think this is a generic "hey, the process can change creds between
> ring creation - and thread creation - and submission of an io_ring
> command".
> 
> No? Or am I misunderstanding what you're thinking of?

You're not, but creds is just one part of it. But I think we're OK
saying "If you do unshare(CLONE_FILES) or CLONE_FS, then it's not
going to impact async offload for your io_uring". IOW, you really
should do that before setting up your ring(s).


commit 6169c4a517317ff729553b66d55957bf03912dc4
Author: Jens Axboe <[email protected]>
Date:   Sat Mar 6 09:22:27 2021 -0700

    io-wq: always track creds for async issue
    
    If we go async with a request, grab the creds that the task currently has
    assigned and make sure that the async side switches to them. This is
    handled in the same way that we do for registered personalities.
    
    Signed-off-by: Jens Axboe <[email protected]>

diff --git a/fs/io-wq.h b/fs/io-wq.h
index 5fbf7997149e..1ac2f3248088 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -79,8 +79,8 @@ static inline void wq_list_del(struct io_wq_work_list *list,
 
 struct io_wq_work {
 	struct io_wq_work_node list;
+	const struct cred *creds;
 	unsigned flags;
-	unsigned short personality;
 };
 
 static inline struct io_wq_work *wq_next_work(struct io_wq_work *work)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 92c25b5f1349..d51c6ba9268b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1183,6 +1183,9 @@ static void io_prep_async_work(struct io_kiocb *req)
 	const struct io_op_def *def = &io_op_defs[req->opcode];
 	struct io_ring_ctx *ctx = req->ctx;
 
+	if (!req->work.creds)
+		req->work.creds = get_current_cred();
+
 	if (req->flags & REQ_F_FORCE_ASYNC)
 		req->work.flags |= IO_WQ_WORK_CONCURRENT;
 
@@ -1648,6 +1651,10 @@ static void io_dismantle_req(struct io_kiocb *req)
 		io_put_file(req, req->file, (req->flags & REQ_F_FIXED_FILE));
 	if (req->fixed_rsrc_refs)
 		percpu_ref_put(req->fixed_rsrc_refs);
+	if (req->work.creds) {
+		put_cred(req->work.creds);
+		req->work.creds = NULL;
+	}
 
 	if (req->flags & REQ_F_INFLIGHT) {
 		struct io_ring_ctx *ctx = req->ctx;
@@ -5916,18 +5923,8 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 	const struct cred *creds = NULL;
 	int ret;
 
-	if (req->work.personality) {
-		const struct cred *new_creds;
-
-		if (!(issue_flags & IO_URING_F_NONBLOCK))
-			mutex_lock(&ctx->uring_lock);
-		new_creds = idr_find(&ctx->personality_idr, req->work.personality);
-		if (!(issue_flags & IO_URING_F_NONBLOCK))
-			mutex_unlock(&ctx->uring_lock);
-		if (!new_creds)
-			return -EINVAL;
-		creds = override_creds(new_creds);
-	}
+	if (req->work.creds && req->work.creds != current_cred())
+		creds = override_creds(req->work.creds);
 
 	switch (req->opcode) {
 	case IORING_OP_NOP:
@@ -6291,7 +6288,7 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 {
 	struct io_submit_state *state;
 	unsigned int sqe_flags;
-	int ret = 0;
+	int personality, ret = 0;
 
 	req->opcode = READ_ONCE(sqe->opcode);
 	/* same numerical values with corresponding REQ_F_*, safe to copy */
@@ -6324,8 +6321,16 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		return -EOPNOTSUPP;
 
 	req->work.list.next = NULL;
+	personality = READ_ONCE(sqe->personality);
+	if (personality) {
+		req->work.creds = idr_find(&ctx->personality_idr, personality);
+		if (!req->work.creds)
+			return -EINVAL;
+		get_cred(req->work.creds);
+	} else {
+		req->work.creds = NULL;
+	}
 	req->work.flags = 0;
-	req->work.personality = READ_ONCE(sqe->personality);
 	state = &ctx->submit_state;
 
 	/*

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] io_uring fixes for 5.12-rc2
  2021-03-06 16:25       ` Jens Axboe
@ 2021-03-06 21:14         ` Linus Torvalds
  2021-03-06 22:28           ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2021-03-06 21:14 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Sat, Mar 6, 2021 at 8:25 AM Jens Axboe <[email protected]> wrote:
>
> You're not, but creds is just one part of it. But I think we're OK
> saying "If you do unshare(CLONE_FILES) or CLONE_FS, then it's not
> going to impact async offload for your io_uring". IOW, you really
> should do that before setting up your ring(s).

Yeah, I think that one is solidly in a "don't do that then" thing.

Changing credentials is "normal use" - if you are a server process,
you may want to just make sure that you use different credentials for
different client requests etc.

But doing something like "I set up an io_uring for async work, and
then I dis-associated from my existing FS state entirely", that's not
normal. That's just a "you did something crazy, and you'll get crazy
results back, because the async part is fundamentally different from
the synchronous part".

It might be an option to just tear down the IO uring state entirely if
somebody does a "unshare(CLONE_FILES)", the same way unsharing the VM
(ie execve) does. But then it shouldn't even try to keep the state in
sync, it would just be "ok, your old io_uring is now simply gone".

I'm not sure it's even worth worrying about. Because at some point,
it's just "garbage in, garbage out".

               Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] io_uring fixes for 5.12-rc2
  2021-03-06 21:14         ` Linus Torvalds
@ 2021-03-06 22:28           ` Jens Axboe
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2021-03-06 22:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring

On 3/6/21 2:14 PM, Linus Torvalds wrote:
> On Sat, Mar 6, 2021 at 8:25 AM Jens Axboe <[email protected]> wrote:
>>
>> You're not, but creds is just one part of it. But I think we're OK
>> saying "If you do unshare(CLONE_FILES) or CLONE_FS, then it's not
>> going to impact async offload for your io_uring". IOW, you really
>> should do that before setting up your ring(s).
> 
> Yeah, I think that one is solidly in a "don't do that then" thing.
> 
> Changing credentials is "normal use" - if you are a server process,
> you may want to just make sure that you use different credentials for
> different client requests etc.
> 
> But doing something like "I set up an io_uring for async work, and
> then I dis-associated from my existing FS state entirely", that's not
> normal. That's just a "you did something crazy, and you'll get crazy
> results back, because the async part is fundamentally different from
> the synchronous part".

I agree, just wanted to make sure I wasn't the odd one out here...

> It might be an option to just tear down the IO uring state entirely if
> somebody does a "unshare(CLONE_FILES)", the same way unsharing the VM
> (ie execve) does. But then it shouldn't even try to keep the state in
> sync, it would just be "ok, your old io_uring is now simply gone".
> 
> I'm not sure it's even worth worrying about. Because at some point,
> it's just "garbage in, garbage out".

I'd rather not worry about it. It's perfectly legit, and lots of uses
cases do this, to use a ring and never have any work that uses the
io-wq thread helpers. For those cases, there's never any concern
with having moved to a different fs or files. So I'd rather just
put that one in the category of "don't do weird stuff, or you'll
get unexpected results".

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-03-06 22:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-03-05 18:09 [GIT PULL] io_uring fixes for 5.12-rc2 Jens Axboe
2021-03-05 20:54 ` Linus Torvalds
2021-03-05 21:58   ` Jens Axboe
2021-03-05 23:03     ` Linus Torvalds
2021-03-06 16:25       ` Jens Axboe
2021-03-06 21:14         ` Linus Torvalds
2021-03-06 22:28           ` Jens Axboe
2021-03-05 21:58 ` pr-tracker-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox