[PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
@ 2020-09-08 17:48 Jens Axboe
  2020-09-08 20:58 ` Pavel Begunkov
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2020-09-08 17:48 UTC (permalink / raw)
  To: io-uring

Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
the ring fd/file appropriately so we can defer grab them.

Reported-by: Josef Grieb <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
 fs/io_uring.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 80913973337a..e21a7a9c6a59 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6757,7 +6757,7 @@ static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
 
 	mutex_lock(&ctx->uring_lock);
 	if (likely(!percpu_ref_is_dying(&ctx->refs)))
-		ret = io_submit_sqes(ctx, to_submit, NULL, -1);
+		ret = io_submit_sqes(ctx, to_submit, ctx->ring_file, ctx->ring_fd);
 	mutex_unlock(&ctx->uring_lock);
 
 	if (!io_sqring_full(ctx) && wq_has_sleeper(&ctx->sqo_sq_wait))
@@ -8966,6 +8966,11 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
 		goto err;
 	}
 
+	if (p->flags & IORING_SETUP_SQPOLL) {
+		ctx->ring_fd = fd;
+		ctx->ring_file = file;
+	}
+
 	ret = io_sq_offload_create(ctx, p);
 	if (ret)
 		goto err;
-- 
2.28.0

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-08 17:48 [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL Jens Axboe
@ 2020-09-08 20:58 ` Pavel Begunkov
  2020-09-08 21:22   ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Pavel Begunkov @ 2020-09-08 20:58 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 08/09/2020 20:48, Jens Axboe wrote:
> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
> the ring fd/file appropriately so we can defer grab them.

IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
that isn't the case with SQPOLL threads. Am I mistaken?

And it looks strange that the following snippet will effectively disable
such requests.

fd = dup(ring_fd)
close(ring_fd)
ring_fd = fd

> 
> Reported-by: Josef Grieb <[email protected]>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
>  fs/io_uring.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 80913973337a..e21a7a9c6a59 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -6757,7 +6757,7 @@ static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
>  
>  	mutex_lock(&ctx->uring_lock);
>  	if (likely(!percpu_ref_is_dying(&ctx->refs)))
> -		ret = io_submit_sqes(ctx, to_submit, NULL, -1);
> +		ret = io_submit_sqes(ctx, to_submit, ctx->ring_file, ctx->ring_fd);
>  	mutex_unlock(&ctx->uring_lock);
>  
>  	if (!io_sqring_full(ctx) && wq_has_sleeper(&ctx->sqo_sq_wait))
> @@ -8966,6 +8966,11 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
>  		goto err;
>  	}
>  
> +	if (p->flags & IORING_SETUP_SQPOLL) {
> +		ctx->ring_fd = fd;
> +		ctx->ring_file = file;
> +	}
> +
>  	ret = io_sq_offload_create(ctx, p);
>  	if (ret)
>  		goto err;
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-08 20:58 ` Pavel Begunkov
@ 2020-09-08 21:22   ` Jens Axboe
  2020-09-08 22:54     ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2020-09-08 21:22 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 9/8/20 2:58 PM, Pavel Begunkov wrote:
> On 08/09/2020 20:48, Jens Axboe wrote:
>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
>> the ring fd/file appropriately so we can defer grab them.
> 
> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
> that isn't the case with SQPOLL threads. Am I mistaken?
> 
> And it looks strange that the following snippet will effectively disable
> such requests.
> 
> fd = dup(ring_fd)
> close(ring_fd)
> ring_fd = fd

Not disagreeing with that, I think my initial posting made it clear
it was a hack. Just piled it in there for easier testing in terms
of functionality.

But the next question is how to do this right...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-08 21:22   ` Jens Axboe
@ 2020-09-08 22:54     ` Jens Axboe
  2020-09-09  0:48       ` Josef
  2020-09-09  7:09       ` Pavel Begunkov
  0 siblings, 2 replies; 18+ messages in thread
From: Jens Axboe @ 2020-09-08 22:54 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 9/8/20 3:22 PM, Jens Axboe wrote:
> On 9/8/20 2:58 PM, Pavel Begunkov wrote:
>> On 08/09/2020 20:48, Jens Axboe wrote:
>>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
>>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
>>> the ring fd/file appropriately so we can defer grab them.
>>
>> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
>> that isn't the case with SQPOLL threads. Am I mistaken?
>>
>> And it looks strange that the following snippet will effectively disable
>> such requests.
>>
>> fd = dup(ring_fd)
>> close(ring_fd)
>> ring_fd = fd
> 
> Not disagreeing with that, I think my initial posting made it clear
> it was a hack. Just piled it in there for easier testing in terms
> of functionality.
> 
> But the next question is how to do this right...

Looking at this a bit more, and I don't necessarily think there's a
better option. If you dup+close, then it just won't work. We have no
way of knowing if the 'fd' changed, but we can detect if it was closed
and then we'll end up just EBADF'ing the requests.

So right now the answer is that we can support this just fine with
SQPOLL, but you better not dup and close the original fd. Which is not
ideal, but better than NOT being able to support it.

Only other option I see is to to provide an io_uring_register()
command to update the fd/file associated with it. Which may be useful,
it allows a process to indeed to this, if it absolutely has to.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-08 22:54     ` Jens Axboe
@ 2020-09-09  0:48       ` Josef
  2020-09-09  7:09       ` Pavel Begunkov
  1 sibling, 0 replies; 18+ messages in thread
From: Josef @ 2020-09-09  0:48 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring; +Cc: norman

BTW I can confirm this patch works for me, thanks a lot :)

---
Josef

On Wed, 9 Sep 2020 at 00:55, Jens Axboe <[email protected]> wrote:
>
> On 9/8/20 3:22 PM, Jens Axboe wrote:
> > On 9/8/20 2:58 PM, Pavel Begunkov wrote:
> >> On 08/09/2020 20:48, Jens Axboe wrote:
> >>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
> >>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
> >>> the ring fd/file appropriately so we can defer grab them.
> >>
> >> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
> >> that isn't the case with SQPOLL threads. Am I mistaken?
> >>
> >> And it looks strange that the following snippet will effectively disable
> >> such requests.
> >>
> >> fd = dup(ring_fd)
> >> close(ring_fd)
> >> ring_fd = fd
> >
> > Not disagreeing with that, I think my initial posting made it clear
> > it was a hack. Just piled it in there for easier testing in terms
> > of functionality.
> >
> > But the next question is how to do this right...
>
> Looking at this a bit more, and I don't necessarily think there's a
> better option. If you dup+close, then it just won't work. We have no
> way of knowing if the 'fd' changed, but we can detect if it was closed
> and then we'll end up just EBADF'ing the requests.
>
> So right now the answer is that we can support this just fine with
> SQPOLL, but you better not dup and close the original fd. Which is not
> ideal, but better than NOT being able to support it.
>
> Only other option I see is to to provide an io_uring_register()
> command to update the fd/file associated with it. Which may be useful,
> it allows a process to indeed to this, if it absolutely has to.
>
> --
> Jens Axboe
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-08 22:54     ` Jens Axboe
  2020-09-09  0:48       ` Josef
@ 2020-09-09  7:09       ` Pavel Begunkov
  2020-09-09 13:10         ` Jens Axboe
  1 sibling, 1 reply; 18+ messages in thread
From: Pavel Begunkov @ 2020-09-09  7:09 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 09/09/2020 01:54, Jens Axboe wrote:
> On 9/8/20 3:22 PM, Jens Axboe wrote:
>> On 9/8/20 2:58 PM, Pavel Begunkov wrote:
>>> On 08/09/2020 20:48, Jens Axboe wrote:
>>>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
>>>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
>>>> the ring fd/file appropriately so we can defer grab them.
>>>
>>> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
>>> that isn't the case with SQPOLL threads. Am I mistaken?
>>>
>>> And it looks strange that the following snippet will effectively disable
>>> such requests.
>>>
>>> fd = dup(ring_fd)
>>> close(ring_fd)
>>> ring_fd = fd
>>
>> Not disagreeing with that, I think my initial posting made it clear
>> it was a hack. Just piled it in there for easier testing in terms
>> of functionality.
>>
>> But the next question is how to do this right...> 
> Looking at this a bit more, and I don't necessarily think there's a
> better option. If you dup+close, then it just won't work. We have no
> way of knowing if the 'fd' changed, but we can detect if it was closed
> and then we'll end up just EBADF'ing the requests.
> 
> So right now the answer is that we can support this just fine with
> SQPOLL, but you better not dup and close the original fd. Which is not
> ideal, but better than NOT being able to support it.
> 
> Only other option I see is to to provide an io_uring_register()
> command to update the fd/file associated with it. Which may be useful,
> it allows a process to indeed to this, if it absolutely has to.

Let's put aside such dirty hacks, at least until someone actually
needs it. Ideally, for many reasons I'd prefer to get rid of
fcheck(ctx->ring_fd) in favour of synchronisation in
io_grab_files(), but I wish I knew how.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-09  7:09       ` Pavel Begunkov
@ 2020-09-09 13:10         ` Jens Axboe
  2020-09-09 13:53           ` Jens Axboe
  2020-09-09 15:48           ` Pavel Begunkov
  0 siblings, 2 replies; 18+ messages in thread
From: Jens Axboe @ 2020-09-09 13:10 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 9/9/20 1:09 AM, Pavel Begunkov wrote:
> On 09/09/2020 01:54, Jens Axboe wrote:
>> On 9/8/20 3:22 PM, Jens Axboe wrote:
>>> On 9/8/20 2:58 PM, Pavel Begunkov wrote:
>>>> On 08/09/2020 20:48, Jens Axboe wrote:
>>>>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
>>>>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
>>>>> the ring fd/file appropriately so we can defer grab them.
>>>>
>>>> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
>>>> that isn't the case with SQPOLL threads. Am I mistaken?
>>>>
>>>> And it looks strange that the following snippet will effectively disable
>>>> such requests.
>>>>
>>>> fd = dup(ring_fd)
>>>> close(ring_fd)
>>>> ring_fd = fd
>>>
>>> Not disagreeing with that, I think my initial posting made it clear
>>> it was a hack. Just piled it in there for easier testing in terms
>>> of functionality.
>>>
>>> But the next question is how to do this right...> 
>> Looking at this a bit more, and I don't necessarily think there's a
>> better option. If you dup+close, then it just won't work. We have no
>> way of knowing if the 'fd' changed, but we can detect if it was closed
>> and then we'll end up just EBADF'ing the requests.
>>
>> So right now the answer is that we can support this just fine with
>> SQPOLL, but you better not dup and close the original fd. Which is not
>> ideal, but better than NOT being able to support it.
>>
>> Only other option I see is to to provide an io_uring_register()
>> command to update the fd/file associated with it. Which may be useful,
>> it allows a process to indeed to this, if it absolutely has to.
> 
> Let's put aside such dirty hacks, at least until someone actually
> needs it. Ideally, for many reasons I'd prefer to get rid of

BUt it is actually needed, otherwise we're even more in a limbo state of
"SQPOLL works for most things now, just not all". And this isn't that
hard to make right - on the flush() side, we just need to park/stall the
thread and clear the ring_fd/ring_file, then mark the ring as needing a
queue enter. On the queue enter, we re-set the fd/file if they're NULL,
unpark/unstall the thread. That's it. I'll write this up and test it.

> fcheck(ctx->ring_fd) in favour of synchronisation in io_grab_files(),
> but I wish I knew how.

That'd be nice, and apply equally to all cases as the SQPOLL case isn't
special at all anymore.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-09 13:10         ` Jens Axboe
@ 2020-09-09 13:53           ` Jens Axboe
  2020-09-09 15:48           ` Pavel Begunkov
  1 sibling, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2020-09-09 13:53 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 9/9/20 7:10 AM, Jens Axboe wrote:
> On 9/9/20 1:09 AM, Pavel Begunkov wrote:
>> On 09/09/2020 01:54, Jens Axboe wrote:
>>> On 9/8/20 3:22 PM, Jens Axboe wrote:
>>>> On 9/8/20 2:58 PM, Pavel Begunkov wrote:
>>>>> On 08/09/2020 20:48, Jens Axboe wrote:
>>>>>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
>>>>>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
>>>>>> the ring fd/file appropriately so we can defer grab them.
>>>>>
>>>>> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
>>>>> that isn't the case with SQPOLL threads. Am I mistaken?
>>>>>
>>>>> And it looks strange that the following snippet will effectively disable
>>>>> such requests.
>>>>>
>>>>> fd = dup(ring_fd)
>>>>> close(ring_fd)
>>>>> ring_fd = fd
>>>>
>>>> Not disagreeing with that, I think my initial posting made it clear
>>>> it was a hack. Just piled it in there for easier testing in terms
>>>> of functionality.
>>>>
>>>> But the next question is how to do this right...> 
>>> Looking at this a bit more, and I don't necessarily think there's a
>>> better option. If you dup+close, then it just won't work. We have no
>>> way of knowing if the 'fd' changed, but we can detect if it was closed
>>> and then we'll end up just EBADF'ing the requests.
>>>
>>> So right now the answer is that we can support this just fine with
>>> SQPOLL, but you better not dup and close the original fd. Which is not
>>> ideal, but better than NOT being able to support it.
>>>
>>> Only other option I see is to to provide an io_uring_register()
>>> command to update the fd/file associated with it. Which may be useful,
>>> it allows a process to indeed to this, if it absolutely has to.
>>
>> Let's put aside such dirty hacks, at least until someone actually
>> needs it. Ideally, for many reasons I'd prefer to get rid of
> 
> BUt it is actually needed, otherwise we're even more in a limbo state of
> "SQPOLL works for most things now, just not all". And this isn't that
> hard to make right - on the flush() side, we just need to park/stall the
> thread and clear the ring_fd/ring_file, then mark the ring as needing a
> queue enter. On the queue enter, we re-set the fd/file if they're NULL,
> unpark/unstall the thread. That's it. I'll write this up and test it.

Something like this - make sure the thread is idle if the ring fd is
closed, then clear our ring fd/file. Mark the ring as needing a kernel
enter, and when that enter happens, assign the current fd/file. Once
that's done, we can resume processing.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index bd1ac8581d38..652cc53432d4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6704,7 +6704,14 @@ static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
 		mutex_unlock(&ctx->uring_lock);
 	}
 
-	to_submit = io_sqring_entries(ctx);
+	/*
+	 * If ->ring_file is NULL, we're waiting on new fd/file assigment.
+	 * Don't submit anything new until that happens.
+	 */
+	if (ctx->ring_file)
+		to_submit = io_sqring_entries(ctx);
+	else
+		to_submit = 0;
 
 	/*
 	 * If submit got -EBUSY, flag us as needing the application
@@ -6748,7 +6755,7 @@ static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
 		}
 
 		to_submit = io_sqring_entries(ctx);
-		if (!to_submit || ret == -EBUSY)
+		if (!to_submit || ret == -EBUSY || !ctx->ring_file)
 			return SQT_IDLE;
 
 		finish_wait(&sqd->wait, &ctx->sqo_wait_entry);
@@ -8521,6 +8528,18 @@ static int io_uring_flush(struct file *file, void *data)
 	if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) {
 		io_sq_thread_stop(ctx);
 		io_wq_cancel_cb(ctx->io_wq, io_cancel_task_cb, current, true);
+	} else if (ctx->flags & IORING_SETUP_SQPOLL) {
+		struct io_sq_data *sqd = ctx->sq_data;
+
+		/* Ring is being closed, mark us as neding new assignment */
+		if (sqd->thread)
+			kthread_park(sqd->thread);
+		ctx->ring_fd = -1;
+		ctx->ring_file = NULL;
+		set_bit(1, &ctx->sq_check_overflow);
+		io_ring_set_wakeup_flag(ctx);
+		if (sqd->thread)
+			kthread_unpark(sqd->thread);
 	}
 
 	return 0;
@@ -8656,6 +8675,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
 		if (!list_empty_careful(&ctx->cq_overflow_list))
 			io_cqring_overflow_flush(ctx, false);
+		/* marked as needing new fd assigment, process it now */
+		if (test_bit(1, &ctx->sq_check_overflow) &&
+		    test_and_clear_bit(1, &ctx->sq_check_overflow)) {
+			struct io_sq_data *sqd = ctx->sq_data;
+
+			if (sqd->thread)
+				kthread_park(sqd->thread);
+
+			ctx->ring_fd = fd;
+			ctx->ring_file = f.file;
+
+			if (sqd->thread)
+				kthread_unpark(sqd->thread);
+		}
 		if (flags & IORING_ENTER_SQ_WAKEUP)
 			wake_up(&ctx->sq_data->wait);
 		if (flags & IORING_ENTER_SQ_WAIT)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-09 13:10         ` Jens Axboe
  2020-09-09 13:53           ` Jens Axboe
@ 2020-09-09 15:48           ` Pavel Begunkov
  2020-09-09 16:07             ` Jens Axboe
  1 sibling, 1 reply; 18+ messages in thread
From: Pavel Begunkov @ 2020-09-09 15:48 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 09/09/2020 16:10, Jens Axboe wrote:
> On 9/9/20 1:09 AM, Pavel Begunkov wrote:
>> On 09/09/2020 01:54, Jens Axboe wrote:
>>> On 9/8/20 3:22 PM, Jens Axboe wrote:
>>>> On 9/8/20 2:58 PM, Pavel Begunkov wrote:
>>>>> On 08/09/2020 20:48, Jens Axboe wrote:
>>>>>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
>>>>>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
>>>>>> the ring fd/file appropriately so we can defer grab them.
>>>>>
>>>>> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
>>>>> that isn't the case with SQPOLL threads. Am I mistaken?
>>>>>
>>>>> And it looks strange that the following snippet will effectively disable
>>>>> such requests.
>>>>>
>>>>> fd = dup(ring_fd)
>>>>> close(ring_fd)
>>>>> ring_fd = fd
>>>>
>>>> Not disagreeing with that, I think my initial posting made it clear
>>>> it was a hack. Just piled it in there for easier testing in terms
>>>> of functionality.
>>>>
>>>> But the next question is how to do this right...> 
>>> Looking at this a bit more, and I don't necessarily think there's a
>>> better option. If you dup+close, then it just won't work. We have no
>>> way of knowing if the 'fd' changed, but we can detect if it was closed
>>> and then we'll end up just EBADF'ing the requests.
>>>
>>> So right now the answer is that we can support this just fine with
>>> SQPOLL, but you better not dup and close the original fd. Which is not
>>> ideal, but better than NOT being able to support it.
>>>
>>> Only other option I see is to to provide an io_uring_register()
>>> command to update the fd/file associated with it. Which may be useful,
>>> it allows a process to indeed to this, if it absolutely has to.
>>
>> Let's put aside such dirty hacks, at least until someone actually
>> needs it. Ideally, for many reasons I'd prefer to get rid of
> 
> BUt it is actually needed, otherwise we're even more in a limbo state of
> "SQPOLL works for most things now, just not all". And this isn't that
> hard to make right - on the flush() side, we just need to park/stall the

I understand that it isn't hard, but I just don't want to expose it to
the userspace, a) because it's a userspace API, so couldn't probably be
killed in the future, b) works around kernel's problems, and so
shouldn't really be exposed to the userspace in normal circumstances.

And it's not generic enough because of a possible "many fds -> single
file" mapping, and there will be a lot of questions and problems.

e.g. if a process shares a io_uring with another process, then
dup()+close() would require not only this hook but also additional
inter-process synchronisation. And so on.


> thread and clear the ring_fd/ring_file, then mark the ring as needing a
> queue enter. On the queue enter, we re-set the fd/file if they're NULL,
> unpark/unstall the thread. That's it. I'll write this up and test it.
> 
>> fcheck(ctx->ring_fd) in favour of synchronisation in io_grab_files(),
>> but I wish I knew how.
> 
> That'd be nice, and apply equally to all cases as the SQPOLL case isn't
> special at all anymore.

I miss the whole story, have you asked fs guys about the problem?
Or is it known that nothing would work?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-09 15:48           ` Pavel Begunkov
@ 2020-09-09 16:07             ` Jens Axboe
  2020-09-10 12:37               ` Pavel Begunkov
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2020-09-09 16:07 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 9/9/20 9:48 AM, Pavel Begunkov wrote:
> On 09/09/2020 16:10, Jens Axboe wrote:
>> On 9/9/20 1:09 AM, Pavel Begunkov wrote:
>>> On 09/09/2020 01:54, Jens Axboe wrote:
>>>> On 9/8/20 3:22 PM, Jens Axboe wrote:
>>>>> On 9/8/20 2:58 PM, Pavel Begunkov wrote:
>>>>>> On 08/09/2020 20:48, Jens Axboe wrote:
>>>>>>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
>>>>>>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
>>>>>>> the ring fd/file appropriately so we can defer grab them.
>>>>>>
>>>>>> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
>>>>>> that isn't the case with SQPOLL threads. Am I mistaken?
>>>>>>
>>>>>> And it looks strange that the following snippet will effectively disable
>>>>>> such requests.
>>>>>>
>>>>>> fd = dup(ring_fd)
>>>>>> close(ring_fd)
>>>>>> ring_fd = fd
>>>>>
>>>>> Not disagreeing with that, I think my initial posting made it clear
>>>>> it was a hack. Just piled it in there for easier testing in terms
>>>>> of functionality.
>>>>>
>>>>> But the next question is how to do this right...> 
>>>> Looking at this a bit more, and I don't necessarily think there's a
>>>> better option. If you dup+close, then it just won't work. We have no
>>>> way of knowing if the 'fd' changed, but we can detect if it was closed
>>>> and then we'll end up just EBADF'ing the requests.
>>>>
>>>> So right now the answer is that we can support this just fine with
>>>> SQPOLL, but you better not dup and close the original fd. Which is not
>>>> ideal, but better than NOT being able to support it.
>>>>
>>>> Only other option I see is to to provide an io_uring_register()
>>>> command to update the fd/file associated with it. Which may be useful,
>>>> it allows a process to indeed to this, if it absolutely has to.
>>>
>>> Let's put aside such dirty hacks, at least until someone actually
>>> needs it. Ideally, for many reasons I'd prefer to get rid of
>>
>> BUt it is actually needed, otherwise we're even more in a limbo state of
>> "SQPOLL works for most things now, just not all". And this isn't that
>> hard to make right - on the flush() side, we just need to park/stall the
> 
> I understand that it isn't hard, but I just don't want to expose it to
> the userspace, a) because it's a userspace API, so couldn't probably be
> killed in the future, b) works around kernel's problems, and so
> shouldn't really be exposed to the userspace in normal circumstances.
> 
> And it's not generic enough because of a possible "many fds -> single
> file" mapping, and there will be a lot of questions and problems.
> 
> e.g. if a process shares a io_uring with another process, then
> dup()+close() would require not only this hook but also additional
> inter-process synchronisation. And so on.

I think you're blowing this out of proportion. Just to restate the
goal, but it's to have SQPOLL be as useful as the other modes. One of
those things is making non-registered files work. For some use cases,
registered files is fine, for others it's basically a non-starter.

With that out of the way, the included patch handles the "close ring
fd case". You're talking about the dup or receive case, or anything
that doesn't close an existing ring. And yes, that won't work as-is,
because we know have multiple fds for that particular ring. That boils
the case down to "we're now using this fd for the ring", and the only
requirement here would be to ensure you do a io_uring_enter() if you
decide to swap fds or use a new fd. Only caveat here is that we can't
make it automatic like we can for the "old fd gets closed" case, so
the app would absolutely have to ensure it enters the kernel if it
uses a new fd.

Not really a huge deal imho in terms of API, especially since this
is into the realm of "nobody probably ever does this, or if they do,
then this requirement isn't really a problem".

>> thread and clear the ring_fd/ring_file, then mark the ring as needing a
>> queue enter. On the queue enter, we re-set the fd/file if they're NULL,
>> unpark/unstall the thread. That's it. I'll write this up and test it.
>>
>>> fcheck(ctx->ring_fd) in favour of synchronisation in io_grab_files(),
>>> but I wish I knew how.
>>
>> That'd be nice, and apply equally to all cases as the SQPOLL case isn't
>> special at all anymore.
> 
> I miss the whole story, have you asked fs guys about the problem?
> Or is it known that nothing would work?

I haven't looked into it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-09 16:07             ` Jens Axboe
@ 2020-09-10 12:37               ` Pavel Begunkov
  2020-09-10 13:11                 ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Pavel Begunkov @ 2020-09-10 12:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 09/09/2020 19:07, Jens Axboe wrote:
> On 9/9/20 9:48 AM, Pavel Begunkov wrote:
>> On 09/09/2020 16:10, Jens Axboe wrote:
>>> On 9/9/20 1:09 AM, Pavel Begunkov wrote:
>>>> On 09/09/2020 01:54, Jens Axboe wrote:
>>>>> On 9/8/20 3:22 PM, Jens Axboe wrote:
>>>>>> On 9/8/20 2:58 PM, Pavel Begunkov wrote:
>>>>>>> On 08/09/2020 20:48, Jens Axboe wrote:
>>>>>>>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
>>>>>>>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
>>>>>>>> the ring fd/file appropriately so we can defer grab them.
>>>>>>>
>>>>>>> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
>>>>>>> that isn't the case with SQPOLL threads. Am I mistaken?
>>>>>>>
>>>>>>> And it looks strange that the following snippet will effectively disable
>>>>>>> such requests.
>>>>>>>
>>>>>>> fd = dup(ring_fd)
>>>>>>> close(ring_fd)
>>>>>>> ring_fd = fd
>>>>>>
>>>>>> Not disagreeing with that, I think my initial posting made it clear
>>>>>> it was a hack. Just piled it in there for easier testing in terms
>>>>>> of functionality.
>>>>>>
>>>>>> But the next question is how to do this right...> 
>>>>> Looking at this a bit more, and I don't necessarily think there's a
>>>>> better option. If you dup+close, then it just won't work. We have no
>>>>> way of knowing if the 'fd' changed, but we can detect if it was closed
>>>>> and then we'll end up just EBADF'ing the requests.
>>>>>
>>>>> So right now the answer is that we can support this just fine with
>>>>> SQPOLL, but you better not dup and close the original fd. Which is not
>>>>> ideal, but better than NOT being able to support it.
>>>>>
>>>>> Only other option I see is to to provide an io_uring_register()
>>>>> command to update the fd/file associated with it. Which may be useful,
>>>>> it allows a process to indeed to this, if it absolutely has to.
>>>>
>>>> Let's put aside such dirty hacks, at least until someone actually
>>>> needs it. Ideally, for many reasons I'd prefer to get rid of
>>>
>>> BUt it is actually needed, otherwise we're even more in a limbo state of
>>> "SQPOLL works for most things now, just not all". And this isn't that
>>> hard to make right - on the flush() side, we just need to park/stall the
>>
>> I understand that it isn't hard, but I just don't want to expose it to
>> the userspace, a) because it's a userspace API, so couldn't probably be
>> killed in the future, b) works around kernel's problems, and so
>> shouldn't really be exposed to the userspace in normal circumstances.
>>
>> And it's not generic enough because of a possible "many fds -> single
>> file" mapping, and there will be a lot of questions and problems.
>>
>> e.g. if a process shares a io_uring with another process, then
>> dup()+close() would require not only this hook but also additional
>> inter-process synchronisation. And so on.
> 
> I think you're blowing this out of proportion. Just to restate the

I just think that if there is a potentially cleaner solution without
involving userspace, we should try to look for it first, even if it
would take more time. That was the point.

> goal, but it's to have SQPOLL be as useful as the other modes. One of
> those things is making non-registered files work. For some use cases,
> registered files is fine, for others it's basically a non-starter.> With that out of the way, the included patch handles the "close ring
> fd case". You're talking about the dup or receive case, or anything
> that doesn't close an existing ring. And yes, that won't work as-is,
> because we know have multiple fds for that particular ring. That boils
> the case down to "we're now using this fd for the ring", and the only
> requirement here would be to ensure you do a io_uring_enter() if you
> decide to swap fds or use a new fd. Only caveat here is that we can't
> make it automatic like we can for the "old fd gets closed" case, so
> the app would absolutely have to ensure it enters the kernel if it
> uses a new fd.
> 
> Not really a huge deal imho in terms of API, especially since this
> is into the realm of "nobody probably ever does this, or if they do,
> then this requirement isn't really a problem".
> 
>>> thread and clear the ring_fd/ring_file, then mark the ring as needing a
>>> queue enter. On the queue enter, we re-set the fd/file if they're NULL,
>>> unpark/unstall the thread. That's it. I'll write this up and test it.
>>>
>>>> fcheck(ctx->ring_fd) in favour of synchronisation in io_grab_files(),
>>>> but I wish I knew how.
>>>
>>> That'd be nice, and apply equally to all cases as the SQPOLL case isn't
>>> special at all anymore.
>>
>> I miss the whole story, have you asked fs guys about the problem?
>> Or is it known that nothing would work?
> 
> I haven't looked into it.

Any chance you have someone in mind who can take a look? I don't
think I have a chance to get to anyone in fs.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-10 12:37               ` Pavel Begunkov
@ 2020-09-10 13:11                 ` Jens Axboe
  2020-09-10 18:18                   ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2020-09-10 13:11 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 9/10/20 6:37 AM, Pavel Begunkov wrote:
> On 09/09/2020 19:07, Jens Axboe wrote:
>> On 9/9/20 9:48 AM, Pavel Begunkov wrote:
>>> On 09/09/2020 16:10, Jens Axboe wrote:
>>>> On 9/9/20 1:09 AM, Pavel Begunkov wrote:
>>>>> On 09/09/2020 01:54, Jens Axboe wrote:
>>>>>> On 9/8/20 3:22 PM, Jens Axboe wrote:
>>>>>>> On 9/8/20 2:58 PM, Pavel Begunkov wrote:
>>>>>>>> On 08/09/2020 20:48, Jens Axboe wrote:
>>>>>>>>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
>>>>>>>>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
>>>>>>>>> the ring fd/file appropriately so we can defer grab them.
>>>>>>>>
>>>>>>>> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
>>>>>>>> that isn't the case with SQPOLL threads. Am I mistaken?
>>>>>>>>
>>>>>>>> And it looks strange that the following snippet will effectively disable
>>>>>>>> such requests.
>>>>>>>>
>>>>>>>> fd = dup(ring_fd)
>>>>>>>> close(ring_fd)
>>>>>>>> ring_fd = fd
>>>>>>>
>>>>>>> Not disagreeing with that, I think my initial posting made it clear
>>>>>>> it was a hack. Just piled it in there for easier testing in terms
>>>>>>> of functionality.
>>>>>>>
>>>>>>> But the next question is how to do this right...> 
>>>>>> Looking at this a bit more, and I don't necessarily think there's a
>>>>>> better option. If you dup+close, then it just won't work. We have no
>>>>>> way of knowing if the 'fd' changed, but we can detect if it was closed
>>>>>> and then we'll end up just EBADF'ing the requests.
>>>>>>
>>>>>> So right now the answer is that we can support this just fine with
>>>>>> SQPOLL, but you better not dup and close the original fd. Which is not
>>>>>> ideal, but better than NOT being able to support it.
>>>>>>
>>>>>> Only other option I see is to to provide an io_uring_register()
>>>>>> command to update the fd/file associated with it. Which may be useful,
>>>>>> it allows a process to indeed to this, if it absolutely has to.
>>>>>
>>>>> Let's put aside such dirty hacks, at least until someone actually
>>>>> needs it. Ideally, for many reasons I'd prefer to get rid of
>>>>
>>>> BUt it is actually needed, otherwise we're even more in a limbo state of
>>>> "SQPOLL works for most things now, just not all". And this isn't that
>>>> hard to make right - on the flush() side, we just need to park/stall the
>>>
>>> I understand that it isn't hard, but I just don't want to expose it to
>>> the userspace, a) because it's a userspace API, so couldn't probably be
>>> killed in the future, b) works around kernel's problems, and so
>>> shouldn't really be exposed to the userspace in normal circumstances.
>>>
>>> And it's not generic enough because of a possible "many fds -> single
>>> file" mapping, and there will be a lot of questions and problems.
>>>
>>> e.g. if a process shares a io_uring with another process, then
>>> dup()+close() would require not only this hook but also additional
>>> inter-process synchronisation. And so on.
>>
>> I think you're blowing this out of proportion. Just to restate the
> 
> I just think that if there is a potentially cleaner solution without
> involving userspace, we should try to look for it first, even if it
> would take more time. That was the point.

Regardless of whether or not we can eliminate that need, at least it'll
be a relaxing of the restriction, not an increase of it. It'll never
hurt to do an extra system call for the case where you're swapping fds.
I do get your point, I just don't think it's a big deal.

>>>>> fcheck(ctx->ring_fd) in favour of synchronisation in io_grab_files(),
>>>>> but I wish I knew how.
>>>>
>>>> That'd be nice, and apply equally to all cases as the SQPOLL case isn't
>>>> special at all anymore.
>>>
>>> I miss the whole story, have you asked fs guys about the problem?
>>> Or is it known that nothing would work?
>>
>> I haven't looked into it.
> 
> Any chance you have someone in mind who can take a look? I don't
> think I have a chance to get to anyone in fs.

I'll take a look.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-10 13:11                 ` Jens Axboe
@ 2020-09-10 18:18                   ` Jens Axboe
  2020-09-10 21:01                     ` Jens Axboe
  2020-09-11 19:23                     ` Pavel Begunkov
  0 siblings, 2 replies; 18+ messages in thread
From: Jens Axboe @ 2020-09-10 18:18 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 9/10/20 7:11 AM, Jens Axboe wrote:
> On 9/10/20 6:37 AM, Pavel Begunkov wrote:
>> On 09/09/2020 19:07, Jens Axboe wrote:
>>> On 9/9/20 9:48 AM, Pavel Begunkov wrote:
>>>> On 09/09/2020 16:10, Jens Axboe wrote:
>>>>> On 9/9/20 1:09 AM, Pavel Begunkov wrote:
>>>>>> On 09/09/2020 01:54, Jens Axboe wrote:
>>>>>>> On 9/8/20 3:22 PM, Jens Axboe wrote:
>>>>>>>> On 9/8/20 2:58 PM, Pavel Begunkov wrote:
>>>>>>>>> On 08/09/2020 20:48, Jens Axboe wrote:
>>>>>>>>>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
>>>>>>>>>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
>>>>>>>>>> the ring fd/file appropriately so we can defer grab them.
>>>>>>>>>
>>>>>>>>> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
>>>>>>>>> that isn't the case with SQPOLL threads. Am I mistaken?
>>>>>>>>>
>>>>>>>>> And it looks strange that the following snippet will effectively disable
>>>>>>>>> such requests.
>>>>>>>>>
>>>>>>>>> fd = dup(ring_fd)
>>>>>>>>> close(ring_fd)
>>>>>>>>> ring_fd = fd
>>>>>>>>
>>>>>>>> Not disagreeing with that, I think my initial posting made it clear
>>>>>>>> it was a hack. Just piled it in there for easier testing in terms
>>>>>>>> of functionality.
>>>>>>>>
>>>>>>>> But the next question is how to do this right...> 
>>>>>>> Looking at this a bit more, and I don't necessarily think there's a
>>>>>>> better option. If you dup+close, then it just won't work. We have no
>>>>>>> way of knowing if the 'fd' changed, but we can detect if it was closed
>>>>>>> and then we'll end up just EBADF'ing the requests.
>>>>>>>
>>>>>>> So right now the answer is that we can support this just fine with
>>>>>>> SQPOLL, but you better not dup and close the original fd. Which is not
>>>>>>> ideal, but better than NOT being able to support it.
>>>>>>>
>>>>>>> Only other option I see is to to provide an io_uring_register()
>>>>>>> command to update the fd/file associated with it. Which may be useful,
>>>>>>> it allows a process to indeed to this, if it absolutely has to.
>>>>>>
>>>>>> Let's put aside such dirty hacks, at least until someone actually
>>>>>> needs it. Ideally, for many reasons I'd prefer to get rid of
>>>>>
>>>>> BUt it is actually needed, otherwise we're even more in a limbo state of
>>>>> "SQPOLL works for most things now, just not all". And this isn't that
>>>>> hard to make right - on the flush() side, we just need to park/stall the
>>>>
>>>> I understand that it isn't hard, but I just don't want to expose it to
>>>> the userspace, a) because it's a userspace API, so couldn't probably be
>>>> killed in the future, b) works around kernel's problems, and so
>>>> shouldn't really be exposed to the userspace in normal circumstances.
>>>>
>>>> And it's not generic enough because of a possible "many fds -> single
>>>> file" mapping, and there will be a lot of questions and problems.
>>>>
>>>> e.g. if a process shares a io_uring with another process, then
>>>> dup()+close() would require not only this hook but also additional
>>>> inter-process synchronisation. And so on.
>>>
>>> I think you're blowing this out of proportion. Just to restate the
>>
>> I just think that if there is a potentially cleaner solution without
>> involving userspace, we should try to look for it first, even if it
>> would take more time. That was the point.
> 
> Regardless of whether or not we can eliminate that need, at least it'll
> be a relaxing of the restriction, not an increase of it. It'll never
> hurt to do an extra system call for the case where you're swapping fds.
> I do get your point, I just don't think it's a big deal.

BTW, I don't see how we can ever get rid of a need to enter the kernel,
we'd need some chance at grabbing the updated ->files, for instance.
Might be possible to hold a reference to the task and grab it from
there, though feels a bit iffy to hold a task reference from the ring on
the task that holds a reference to the ring. Haven't looked too close,
should work though as this won't hold a file/files reference, it's just
a freeing reference.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-10 18:18                   ` Jens Axboe
@ 2020-09-10 21:01                     ` Jens Axboe
  2020-09-10 22:11                       ` Jens Axboe
  2020-09-11 19:23                     ` Pavel Begunkov
  1 sibling, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2020-09-10 21:01 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Jann Horn

On 9/10/20 12:18 PM, Jens Axboe wrote:
> On 9/10/20 7:11 AM, Jens Axboe wrote:
>> On 9/10/20 6:37 AM, Pavel Begunkov wrote:
>>> On 09/09/2020 19:07, Jens Axboe wrote:
>>>> On 9/9/20 9:48 AM, Pavel Begunkov wrote:
>>>>> On 09/09/2020 16:10, Jens Axboe wrote:
>>>>>> On 9/9/20 1:09 AM, Pavel Begunkov wrote:
>>>>>>> On 09/09/2020 01:54, Jens Axboe wrote:
>>>>>>>> On 9/8/20 3:22 PM, Jens Axboe wrote:
>>>>>>>>> On 9/8/20 2:58 PM, Pavel Begunkov wrote:
>>>>>>>>>> On 08/09/2020 20:48, Jens Axboe wrote:
>>>>>>>>>>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
>>>>>>>>>>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
>>>>>>>>>>> the ring fd/file appropriately so we can defer grab them.
>>>>>>>>>>
>>>>>>>>>> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
>>>>>>>>>> that isn't the case with SQPOLL threads. Am I mistaken?
>>>>>>>>>>
>>>>>>>>>> And it looks strange that the following snippet will effectively disable
>>>>>>>>>> such requests.
>>>>>>>>>>
>>>>>>>>>> fd = dup(ring_fd)
>>>>>>>>>> close(ring_fd)
>>>>>>>>>> ring_fd = fd
>>>>>>>>>
>>>>>>>>> Not disagreeing with that, I think my initial posting made it clear
>>>>>>>>> it was a hack. Just piled it in there for easier testing in terms
>>>>>>>>> of functionality.
>>>>>>>>>
>>>>>>>>> But the next question is how to do this right...> 
>>>>>>>> Looking at this a bit more, and I don't necessarily think there's a
>>>>>>>> better option. If you dup+close, then it just won't work. We have no
>>>>>>>> way of knowing if the 'fd' changed, but we can detect if it was closed
>>>>>>>> and then we'll end up just EBADF'ing the requests.
>>>>>>>>
>>>>>>>> So right now the answer is that we can support this just fine with
>>>>>>>> SQPOLL, but you better not dup and close the original fd. Which is not
>>>>>>>> ideal, but better than NOT being able to support it.
>>>>>>>>
>>>>>>>> Only other option I see is to to provide an io_uring_register()
>>>>>>>> command to update the fd/file associated with it. Which may be useful,
>>>>>>>> it allows a process to indeed to this, if it absolutely has to.
>>>>>>>
>>>>>>> Let's put aside such dirty hacks, at least until someone actually
>>>>>>> needs it. Ideally, for many reasons I'd prefer to get rid of
>>>>>>
>>>>>> BUt it is actually needed, otherwise we're even more in a limbo state of
>>>>>> "SQPOLL works for most things now, just not all". And this isn't that
>>>>>> hard to make right - on the flush() side, we just need to park/stall the
>>>>>
>>>>> I understand that it isn't hard, but I just don't want to expose it to
>>>>> the userspace, a) because it's a userspace API, so couldn't probably be
>>>>> killed in the future, b) works around kernel's problems, and so
>>>>> shouldn't really be exposed to the userspace in normal circumstances.
>>>>>
>>>>> And it's not generic enough because of a possible "many fds -> single
>>>>> file" mapping, and there will be a lot of questions and problems.
>>>>>
>>>>> e.g. if a process shares a io_uring with another process, then
>>>>> dup()+close() would require not only this hook but also additional
>>>>> inter-process synchronisation. And so on.
>>>>
>>>> I think you're blowing this out of proportion. Just to restate the
>>>
>>> I just think that if there is a potentially cleaner solution without
>>> involving userspace, we should try to look for it first, even if it
>>> would take more time. That was the point.
>>
>> Regardless of whether or not we can eliminate that need, at least it'll
>> be a relaxing of the restriction, not an increase of it. It'll never
>> hurt to do an extra system call for the case where you're swapping fds.
>> I do get your point, I just don't think it's a big deal.
> 
> BTW, I don't see how we can ever get rid of a need to enter the kernel,
> we'd need some chance at grabbing the updated ->files, for instance.
> Might be possible to hold a reference to the task and grab it from
> there, though feels a bit iffy to hold a task reference from the ring on
> the task that holds a reference to the ring. Haven't looked too close,
> should work though as this won't hold a file/files reference, it's just
> a freeing reference.

Sort of half assed attempt...

Idea is to assign a ->files sequence before we grab files, and then
compare with the current one once we need to use the files. If they
mismatch, we -ECANCELED the request.

For SQPOLL, don't grab ->files upfront, grab a reference to the task
instead. Use the task reference to assign files when we need it.

Adding Jann to help poke holes in this scheme. I'd be surprised if it's
solid as-is, but hopefully we can build on this idea and get rid of the
fcheck().

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 98cddcc03a16..0517ffa6cb11 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -290,11 +290,10 @@ struct io_ring_ctx {
 	struct io_wq		*io_wq;
 	struct mm_struct	*sqo_mm;
 	/*
-	 * For SQPOLL usage - no reference is held to this file table, we
-	 * rely on fops->flush() and our callback there waiting for the users
-	 * to finish.
+	 * For SQPOLL usage - we hold a reference to the parent task, so we
+	 * have access to the ->files
 	 */
-	struct files_struct	*sqo_files;
+	struct task_struct	*sqo_task;
 
 	struct wait_queue_entry	sqo_wait_entry;
 	struct list_head	sqd_list;
@@ -309,8 +308,11 @@ struct io_ring_ctx {
 	 */
 	struct fixed_file_data	*file_data;
 	unsigned		nr_user_files;
-	int 			ring_fd;
-	struct file 		*ring_file;
+
+	/* incremented when ->flush() is called */
+	atomic_t		files_seq;
+	/* assigned when ->files are grabbed */
+	int			cur_files_seq;
 
 	/* if used, fixed mapped user buffers */
 	unsigned		nr_user_bufs;
@@ -395,6 +397,7 @@ struct io_close {
 	struct file			*file;
 	struct file			*put_file;
 	int				fd;
+	int				files_seq;
 };
 
 struct io_timeout_data {
@@ -410,6 +413,7 @@ struct io_accept {
 	int __user			*addr_len;
 	int				flags;
 	unsigned long			nofile;
+	int				files_seq;
 };
 
 struct io_sync {
@@ -462,6 +466,7 @@ struct io_sr_msg {
 struct io_open {
 	struct file			*file;
 	int				dfd;
+	int				files_seq;
 	struct filename			*filename;
 	struct open_how			how;
 	unsigned long			nofile;
@@ -472,6 +477,7 @@ struct io_files_update {
 	u64				arg;
 	u32				nr_args;
 	u32				offset;
+	int				files_seq;
 };
 
 struct io_fadvise {
@@ -493,6 +499,7 @@ struct io_epoll {
 	int				epfd;
 	int				op;
 	int				fd;
+	int				files_seq;
 	struct epoll_event		event;
 };
 
@@ -519,6 +526,7 @@ struct io_statx {
 	int				dfd;
 	unsigned int			mask;
 	unsigned int			flags;
+	int				files_seq;
 	const char __user		*filename;
 	struct statx __user		*buffer;
 };
@@ -3861,6 +3869,20 @@ static int io_provide_buffers(struct io_kiocb *req, bool force_nonblock,
 	return 0;
 }
 
+static int io_check_files_seq(struct io_kiocb *req, int *seq)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	if (!req->work.files) {
+		*seq = atomic_read(&ctx->files_seq);
+		return 0;
+	} else if (*seq == ctx->cur_files_seq) {
+		return 0;
+	}
+
+	return -ECANCELED;
+}
+
 static int io_epoll_ctl_prep(struct io_kiocb *req,
 			     const struct io_uring_sqe *sqe)
 {
@@ -3882,6 +3904,7 @@ static int io_epoll_ctl_prep(struct io_kiocb *req,
 			return -EFAULT;
 	}
 
+	req->epoll.files_seq = req->ctx->cur_files_seq;
 	return 0;
 #else
 	return -EOPNOTSUPP;
@@ -3895,10 +3918,15 @@ static int io_epoll_ctl(struct io_kiocb *req, bool force_nonblock,
 	struct io_epoll *ie = &req->epoll;
 	int ret;
 
+	ret = io_check_files_seq(req, &ie->files_seq);
+	if (ret)
+		goto done;
+
 	ret = do_epoll_ctl(ie->epfd, ie->op, ie->fd, &ie->event, force_nonblock);
 	if (force_nonblock && ret == -EAGAIN)
 		return -EAGAIN;
 
+done:
 	if (ret < 0)
 		req_set_fail_links(req);
 	__io_req_complete(req, ret, 0, cs);
@@ -3994,6 +4022,7 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	req->statx.filename = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	req->statx.buffer = u64_to_user_ptr(READ_ONCE(sqe->addr2));
 	req->statx.flags = READ_ONCE(sqe->statx_flags);
+	req->statx.files_seq = req->ctx->cur_files_seq;
 
 	return 0;
 }
@@ -4003,6 +4032,10 @@ static int io_statx(struct io_kiocb *req, bool force_nonblock)
 	struct io_statx *ctx = &req->statx;
 	int ret;
 
+	ret = io_check_files_seq(req, &ctx->files_seq);
+	if (ret)
+		goto done;
+
 	if (force_nonblock) {
 		/* only need file table for an actual valid fd */
 		if (ctx->dfd == -1 || ctx->dfd == AT_FDCWD)
@@ -4013,6 +4046,7 @@ static int io_statx(struct io_kiocb *req, bool force_nonblock)
 	ret = do_statx(ctx->dfd, ctx->filename, ctx->flags, ctx->mask,
 		       ctx->buffer);
 
+done:
 	if (ret < 0)
 		req_set_fail_links(req);
 	io_req_complete(req, ret);
@@ -4038,11 +4072,11 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return -EBADF;
 
 	req->close.fd = READ_ONCE(sqe->fd);
-	if ((req->file && req->file->f_op == &io_uring_fops) ||
-	    req->close.fd == req->ctx->ring_fd)
+	if (req->file && req->file->f_op == &io_uring_fops)
 		return -EBADF;
 
 	req->close.put_file = NULL;
+	req->close.files_seq = req->ctx->cur_files_seq;
 	return 0;
 }
 
@@ -4052,6 +4086,10 @@ static int io_close(struct io_kiocb *req, bool force_nonblock,
 	struct io_close *close = &req->close;
 	int ret;
 
+	ret = io_check_files_seq(req, &close->files_seq);
+	if (ret)
+		goto done;
+
 	/* might be already done during nonblock submission */
 	if (!close->put_file) {
 		ret = __close_fd_get_file(close->fd, &close->put_file);
@@ -4070,10 +4108,11 @@ static int io_close(struct io_kiocb *req, bool force_nonblock,
 
 	/* No ->flush() or already async, safely close from here */
 	ret = filp_close(close->put_file, req->work.files);
-	if (ret < 0)
-		req_set_fail_links(req);
 	fput(close->put_file);
 	close->put_file = NULL;
+done:
+	if (ret < 0)
+		req_set_fail_links(req);
 	__io_req_complete(req, ret, 0, cs);
 	return 0;
 }
@@ -4527,6 +4566,7 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	accept->addr_len = u64_to_user_ptr(READ_ONCE(sqe->addr2));
 	accept->flags = READ_ONCE(sqe->accept_flags);
 	accept->nofile = rlimit(RLIMIT_NOFILE);
+	accept->files_seq = req->ctx->cur_files_seq;
 	return 0;
 }
 
@@ -4537,6 +4577,10 @@ static int io_accept(struct io_kiocb *req, bool force_nonblock,
 	unsigned int file_flags = force_nonblock ? O_NONBLOCK : 0;
 	int ret;
 
+	ret = io_check_files_seq(req, &accept->files_seq);
+	if (ret)
+		goto done;
+
 	if (req->file->f_flags & O_NONBLOCK)
 		req->flags |= REQ_F_NOWAIT;
 
@@ -4545,6 +4589,7 @@ static int io_accept(struct io_kiocb *req, bool force_nonblock,
 					accept->nofile);
 	if (ret == -EAGAIN && force_nonblock)
 		return -EAGAIN;
+done:
 	if (ret < 0) {
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
@@ -5514,6 +5559,7 @@ static int io_files_update_prep(struct io_kiocb *req,
 	if (!req->files_update.nr_args)
 		return -EINVAL;
 	req->files_update.arg = READ_ONCE(sqe->addr);
+	req->files_update.files_seq = req->ctx->cur_files_seq;
 	return 0;
 }
 
@@ -5524,6 +5570,10 @@ static int io_files_update(struct io_kiocb *req, bool force_nonblock,
 	struct io_uring_files_update up;
 	int ret;
 
+	ret = io_check_files_seq(req, &req->files_update.files_seq);
+	if (ret)
+		goto done;
+
 	if (force_nonblock)
 		return -EAGAIN;
 
@@ -5533,7 +5583,7 @@ static int io_files_update(struct io_kiocb *req, bool force_nonblock,
 	mutex_lock(&ctx->uring_lock);
 	ret = __io_sqe_files_update(ctx, &up, req->files_update.nr_args);
 	mutex_unlock(&ctx->uring_lock);
-
+done:
 	if (ret < 0)
 		req_set_fail_links(req);
 	__io_req_complete(req, ret, 0, cs);
@@ -6119,34 +6169,21 @@ static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req,
 
 static int io_grab_files(struct io_kiocb *req)
 {
-	int ret = -EBADF;
 	struct io_ring_ctx *ctx = req->ctx;
 
 	io_req_init_async(req);
 
 	if (req->work.files || (req->flags & REQ_F_NO_FILE_TABLE))
 		return 0;
-	if (!ctx->ring_file)
-		return -EBADF;
 
-	rcu_read_lock();
 	spin_lock_irq(&ctx->inflight_lock);
-	/*
-	 * We use the f_ops->flush() handler to ensure that we can flush
-	 * out work accessing these files if the fd is closed. Check if
-	 * the fd has changed since we started down this path, and disallow
-	 * this operation if it has.
-	 */
-	if (fcheck(ctx->ring_fd) == ctx->ring_file) {
-		list_add(&req->inflight_entry, &ctx->inflight_list);
-		req->flags |= REQ_F_INFLIGHT;
-		req->work.files = current->files;
-		ret = 0;
-	}
+	list_add(&req->inflight_entry, &ctx->inflight_list);
+	req->flags |= REQ_F_INFLIGHT;
+	ctx->cur_files_seq = atomic_read(&ctx->files_seq);
+	req->work.files = current->files;
 	spin_unlock_irq(&ctx->inflight_lock);
-	rcu_read_unlock();
 
-	return ret;
+	return 0;
 }
 
 static inline int io_prep_work_files(struct io_kiocb *req)
@@ -6706,14 +6743,7 @@ static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
 		mutex_unlock(&ctx->uring_lock);
 	}
 
-	/*
-	 * If ->ring_file is NULL, we're waiting on new fd/file assigment.
-	 * Don't submit anything new until that happens.
-	 */
-	if (ctx->ring_file)
-		to_submit = io_sqring_entries(ctx);
-	else
-		to_submit = 0;
+	to_submit = io_sqring_entries(ctx);
 
 	/*
 	 * If submit got -EBUSY, flag us as needing the application
@@ -6757,7 +6787,7 @@ static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
 		}
 
 		to_submit = io_sqring_entries(ctx);
-		if (!to_submit || ret == -EBUSY || !ctx->ring_file)
+		if (!to_submit || ret == -EBUSY)
 			return SQT_IDLE;
 
 		finish_wait(&sqd->wait, &ctx->sqo_wait_entry);
@@ -6824,9 +6854,9 @@ static int io_sq_thread(void *data)
 				old_cred = override_creds(ctx->creds);
 			}
 
-			if (current->files != ctx->sqo_files) {
+			if (current->files != ctx->sqo_task->files) {
 				task_lock(current);
-				current->files = ctx->sqo_files;
+				current->files = ctx->sqo_task->files;
 				task_unlock(current);
 			}
 
@@ -7148,6 +7178,11 @@ static void io_finish_async(struct io_ring_ctx *ctx)
 		io_wq_destroy(ctx->io_wq);
 		ctx->io_wq = NULL;
 	}
+
+	if (ctx->sqo_task) {
+		put_task_struct(ctx->sqo_task);
+		ctx->sqo_task = NULL;
+	}
 }
 
 #if defined(CONFIG_UNIX)
@@ -7794,11 +7829,11 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		mutex_unlock(&sqd->ctx_lock);
 
 		/*
-		 * We will exit the sqthread before current exits, so we can
-		 * avoid taking a reference here and introducing weird
-		 * circular dependencies on the files table.
+		 * Grab task reference for SQPOLL usage. This doesn't
+		 * introduce a circular reference, as the task reference is
+		 * just to ensure that the struct itself stays valid.
 		 */
-		ctx->sqo_files = current->files;
+		ctx->sqo_task = get_task_struct(current);
 
 		ctx->sq_thread_idle = msecs_to_jiffies(p->sq_thread_idle);
 		if (!ctx->sq_thread_idle)
@@ -7840,7 +7875,10 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 
 	return 0;
 err:
-	ctx->sqo_files = NULL;
+	if (ctx->sqo_task) {
+		put_task_struct(ctx->sqo_task);
+		ctx->sqo_task = NULL;
+	}
 	io_finish_async(ctx);
 	return ret;
 }
@@ -8538,6 +8576,9 @@ static int io_uring_flush(struct file *file, void *data)
 {
 	struct io_ring_ctx *ctx = file->private_data;
 
+	/* assume current files sequence is no longer valid */
+	atomic_inc(&ctx->files_seq);
+
 	io_uring_cancel_files(ctx, data);
 
 	/*
@@ -8549,14 +8590,8 @@ static int io_uring_flush(struct file *file, void *data)
 	} else if (ctx->flags & IORING_SETUP_SQPOLL) {
 		struct io_sq_data *sqd = ctx->sq_data;
 
-		/* Ring is being closed, mark us as neding new assignment */
+		/* quiesce sqpoll thread */
 		io_sq_thread_park(sqd);
-		mutex_lock(&ctx->uring_lock);
-		ctx->ring_fd = -1;
-		ctx->ring_file = NULL;
-		ctx->sqo_files = NULL;
-		mutex_unlock(&ctx->uring_lock);
-		io_ring_set_wakeup_flag(ctx);
 		io_sq_thread_unpark(sqd);
 	}
 
@@ -8693,19 +8728,6 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
 		if (!list_empty_careful(&ctx->cq_overflow_list))
 			io_cqring_overflow_flush(ctx, false);
-		if (fd != ctx->ring_fd) {
-			struct io_sq_data *sqd = ctx->sq_data;
-
-			io_sq_thread_park(sqd);
-
-			mutex_lock(&ctx->uring_lock);
-			ctx->ring_fd = fd;
-			ctx->ring_file = f.file;
-			ctx->sqo_files = current->files;
-			mutex_unlock(&ctx->uring_lock);
-
-			io_sq_thread_unpark(sqd);
-		}
 		if (flags & IORING_ENTER_SQ_WAKEUP)
 			wake_up(&ctx->sq_data->wait);
 		if (flags & IORING_ENTER_SQ_WAIT)
@@ -8713,8 +8735,6 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 		submitted = to_submit;
 	} else if (to_submit) {
 		mutex_lock(&ctx->uring_lock);
-		ctx->ring_fd = fd;
-		ctx->ring_file = f.file;
 		submitted = io_submit_sqes(ctx, to_submit);
 		mutex_unlock(&ctx->uring_lock);
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-10 21:01                     ` Jens Axboe
@ 2020-09-10 22:11                       ` Jens Axboe
  2020-09-10 23:04                         ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2020-09-10 22:11 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Jann Horn

[-- Attachment #1: Type: text/plain, Size: 4945 bytes --]

On 9/10/20 3:01 PM, Jens Axboe wrote:
> On 9/10/20 12:18 PM, Jens Axboe wrote:
>> On 9/10/20 7:11 AM, Jens Axboe wrote:
>>> On 9/10/20 6:37 AM, Pavel Begunkov wrote:
>>>> On 09/09/2020 19:07, Jens Axboe wrote:
>>>>> On 9/9/20 9:48 AM, Pavel Begunkov wrote:
>>>>>> On 09/09/2020 16:10, Jens Axboe wrote:
>>>>>>> On 9/9/20 1:09 AM, Pavel Begunkov wrote:
>>>>>>>> On 09/09/2020 01:54, Jens Axboe wrote:
>>>>>>>>> On 9/8/20 3:22 PM, Jens Axboe wrote:
>>>>>>>>>> On 9/8/20 2:58 PM, Pavel Begunkov wrote:
>>>>>>>>>>> On 08/09/2020 20:48, Jens Axboe wrote:
>>>>>>>>>>>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
>>>>>>>>>>>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
>>>>>>>>>>>> the ring fd/file appropriately so we can defer grab them.
>>>>>>>>>>>
>>>>>>>>>>> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
>>>>>>>>>>> that isn't the case with SQPOLL threads. Am I mistaken?
>>>>>>>>>>>
>>>>>>>>>>> And it looks strange that the following snippet will effectively disable
>>>>>>>>>>> such requests.
>>>>>>>>>>>
>>>>>>>>>>> fd = dup(ring_fd)
>>>>>>>>>>> close(ring_fd)
>>>>>>>>>>> ring_fd = fd
>>>>>>>>>>
>>>>>>>>>> Not disagreeing with that, I think my initial posting made it clear
>>>>>>>>>> it was a hack. Just piled it in there for easier testing in terms
>>>>>>>>>> of functionality.
>>>>>>>>>>
>>>>>>>>>> But the next question is how to do this right...> 
>>>>>>>>> Looking at this a bit more, and I don't necessarily think there's a
>>>>>>>>> better option. If you dup+close, then it just won't work. We have no
>>>>>>>>> way of knowing if the 'fd' changed, but we can detect if it was closed
>>>>>>>>> and then we'll end up just EBADF'ing the requests.
>>>>>>>>>
>>>>>>>>> So right now the answer is that we can support this just fine with
>>>>>>>>> SQPOLL, but you better not dup and close the original fd. Which is not
>>>>>>>>> ideal, but better than NOT being able to support it.
>>>>>>>>>
>>>>>>>>> Only other option I see is to to provide an io_uring_register()
>>>>>>>>> command to update the fd/file associated with it. Which may be useful,
>>>>>>>>> it allows a process to indeed to this, if it absolutely has to.
>>>>>>>>
>>>>>>>> Let's put aside such dirty hacks, at least until someone actually
>>>>>>>> needs it. Ideally, for many reasons I'd prefer to get rid of
>>>>>>>
>>>>>>> BUt it is actually needed, otherwise we're even more in a limbo state of
>>>>>>> "SQPOLL works for most things now, just not all". And this isn't that
>>>>>>> hard to make right - on the flush() side, we just need to park/stall the
>>>>>>
>>>>>> I understand that it isn't hard, but I just don't want to expose it to
>>>>>> the userspace, a) because it's a userspace API, so couldn't probably be
>>>>>> killed in the future, b) works around kernel's problems, and so
>>>>>> shouldn't really be exposed to the userspace in normal circumstances.
>>>>>>
>>>>>> And it's not generic enough because of a possible "many fds -> single
>>>>>> file" mapping, and there will be a lot of questions and problems.
>>>>>>
>>>>>> e.g. if a process shares a io_uring with another process, then
>>>>>> dup()+close() would require not only this hook but also additional
>>>>>> inter-process synchronisation. And so on.
>>>>>
>>>>> I think you're blowing this out of proportion. Just to restate the
>>>>
>>>> I just think that if there is a potentially cleaner solution without
>>>> involving userspace, we should try to look for it first, even if it
>>>> would take more time. That was the point.
>>>
>>> Regardless of whether or not we can eliminate that need, at least it'll
>>> be a relaxing of the restriction, not an increase of it. It'll never
>>> hurt to do an extra system call for the case where you're swapping fds.
>>> I do get your point, I just don't think it's a big deal.
>>
>> BTW, I don't see how we can ever get rid of a need to enter the kernel,
>> we'd need some chance at grabbing the updated ->files, for instance.
>> Might be possible to hold a reference to the task and grab it from
>> there, though feels a bit iffy to hold a task reference from the ring on
>> the task that holds a reference to the ring. Haven't looked too close,
>> should work though as this won't hold a file/files reference, it's just
>> a freeing reference.
> 
> Sort of half assed attempt...
> 
> Idea is to assign a ->files sequence before we grab files, and then
> compare with the current one once we need to use the files. If they
> mismatch, we -ECANCELED the request.
> 
> For SQPOLL, don't grab ->files upfront, grab a reference to the task
> instead. Use the task reference to assign files when we need it.
> 
> Adding Jann to help poke holes in this scheme. I'd be surprised if it's
> solid as-is, but hopefully we can build on this idea and get rid of the
> fcheck().

Split it into two, to make it easier to reason about. Added a few
comments, etc.

-- 
Jens Axboe


[-- Attachment #2: 0002-io_uring-implement-flush-sequence-to-handle-files-va.patch --]
[-- Type: text/x-patch, Size: 11413 bytes --]

From 766f5ffcfc9e511c6b26abb8a80e935359a5185f Mon Sep 17 00:00:00 2001
From: Jens Axboe <[email protected]>
Date: Thu, 10 Sep 2020 16:06:21 -0600
Subject: [PATCH 2/2] io_uring: implement ->flush() sequence to handle ->files
 validity

The current scheme stashes away ->ring_fd and ->ring_file, and uses
that to check against whether or not ->files could have changed. This
works, but doesn't work so well for SQPOLL. If the application does
close the ring_fd, then we require that applications enter the kernel
to refresh our state.

Add an atomic sequence for the ->flush() count on the ring fd, and if
we get a mismatch between checking this sequence before and after
grabbing the ->files, then we fail the request.

This should offer the same protection that we currently have, with the
added benefit of being able to update the ->files automatically.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/io_uring.c | 131 +++++++++++++++++++++++++++++---------------------
 1 file changed, 77 insertions(+), 54 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5d0247875237..bf994f195aaf 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -308,8 +308,11 @@ struct io_ring_ctx {
 	 */
 	struct fixed_file_data	*file_data;
 	unsigned		nr_user_files;
-	int 			ring_fd;
-	struct file 		*ring_file;
+
+	/* incremented when ->flush() is called */
+	atomic_t		files_seq;
+	/* assigned when ->files are grabbed */
+	int			cur_files_seq;
 
 	/* if used, fixed mapped user buffers */
 	unsigned		nr_user_bufs;
@@ -394,6 +397,7 @@ struct io_close {
 	struct file			*file;
 	struct file			*put_file;
 	int				fd;
+	int				files_seq;
 };
 
 struct io_timeout_data {
@@ -409,6 +413,7 @@ struct io_accept {
 	int __user			*addr_len;
 	int				flags;
 	unsigned long			nofile;
+	int				files_seq;
 };
 
 struct io_sync {
@@ -461,6 +466,7 @@ struct io_sr_msg {
 struct io_open {
 	struct file			*file;
 	int				dfd;
+	int				files_seq;
 	struct filename			*filename;
 	struct open_how			how;
 	unsigned long			nofile;
@@ -471,6 +477,7 @@ struct io_files_update {
 	u64				arg;
 	u32				nr_args;
 	u32				offset;
+	int				files_seq;
 };
 
 struct io_fadvise {
@@ -492,6 +499,7 @@ struct io_epoll {
 	int				epfd;
 	int				op;
 	int				fd;
+	int				files_seq;
 	struct epoll_event		event;
 };
 
@@ -518,6 +526,7 @@ struct io_statx {
 	int				dfd;
 	unsigned int			mask;
 	unsigned int			flags;
+	int				files_seq;
 	const char __user		*filename;
 	struct statx __user		*buffer;
 };
@@ -3860,6 +3869,28 @@ static int io_provide_buffers(struct io_kiocb *req, bool force_nonblock,
 	return 0;
 }
 
+/*
+ * Check that our ->files sequence matches. If files isn't assigned yet,
+ * just store the current sequence. If they are assigned, check against
+ * the sequence from when they got assigned. If we get a mismatch, we fail
+ * the request. This is only applicable to requests that sets ->file_table
+ * in io_op_defs[], indicating that they need access to the file_struct
+ * when executed async.
+ */
+static int io_check_files_seq(struct io_kiocb *req, int *seq)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	if (!req->work.files) {
+		*seq = atomic_read(&ctx->files_seq);
+		return 0;
+	} else if (*seq == ctx->cur_files_seq) {
+		return 0;
+	}
+
+	return -EBADF;
+}
+
 static int io_epoll_ctl_prep(struct io_kiocb *req,
 			     const struct io_uring_sqe *sqe)
 {
@@ -3881,6 +3912,7 @@ static int io_epoll_ctl_prep(struct io_kiocb *req,
 			return -EFAULT;
 	}
 
+	req->epoll.files_seq = req->ctx->cur_files_seq;
 	return 0;
 #else
 	return -EOPNOTSUPP;
@@ -3894,10 +3926,15 @@ static int io_epoll_ctl(struct io_kiocb *req, bool force_nonblock,
 	struct io_epoll *ie = &req->epoll;
 	int ret;
 
+	ret = io_check_files_seq(req, &ie->files_seq);
+	if (ret)
+		goto done;
+
 	ret = do_epoll_ctl(ie->epfd, ie->op, ie->fd, &ie->event, force_nonblock);
 	if (force_nonblock && ret == -EAGAIN)
 		return -EAGAIN;
 
+done:
 	if (ret < 0)
 		req_set_fail_links(req);
 	__io_req_complete(req, ret, 0, cs);
@@ -3993,6 +4030,7 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	req->statx.filename = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	req->statx.buffer = u64_to_user_ptr(READ_ONCE(sqe->addr2));
 	req->statx.flags = READ_ONCE(sqe->statx_flags);
+	req->statx.files_seq = req->ctx->cur_files_seq;
 
 	return 0;
 }
@@ -4002,6 +4040,10 @@ static int io_statx(struct io_kiocb *req, bool force_nonblock)
 	struct io_statx *ctx = &req->statx;
 	int ret;
 
+	ret = io_check_files_seq(req, &ctx->files_seq);
+	if (ret)
+		goto done;
+
 	if (force_nonblock) {
 		/* only need file table for an actual valid fd */
 		if (ctx->dfd == -1 || ctx->dfd == AT_FDCWD)
@@ -4012,6 +4054,7 @@ static int io_statx(struct io_kiocb *req, bool force_nonblock)
 	ret = do_statx(ctx->dfd, ctx->filename, ctx->flags, ctx->mask,
 		       ctx->buffer);
 
+done:
 	if (ret < 0)
 		req_set_fail_links(req);
 	io_req_complete(req, ret);
@@ -4037,11 +4080,11 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return -EBADF;
 
 	req->close.fd = READ_ONCE(sqe->fd);
-	if ((req->file && req->file->f_op == &io_uring_fops) ||
-	    req->close.fd == req->ctx->ring_fd)
+	if (req->file && req->file->f_op == &io_uring_fops)
 		return -EBADF;
 
 	req->close.put_file = NULL;
+	req->close.files_seq = req->ctx->cur_files_seq;
 	return 0;
 }
 
@@ -4051,6 +4094,10 @@ static int io_close(struct io_kiocb *req, bool force_nonblock,
 	struct io_close *close = &req->close;
 	int ret;
 
+	ret = io_check_files_seq(req, &close->files_seq);
+	if (ret)
+		goto done;
+
 	/* might be already done during nonblock submission */
 	if (!close->put_file) {
 		ret = __close_fd_get_file(close->fd, &close->put_file);
@@ -4069,10 +4116,11 @@ static int io_close(struct io_kiocb *req, bool force_nonblock,
 
 	/* No ->flush() or already async, safely close from here */
 	ret = filp_close(close->put_file, req->work.files);
-	if (ret < 0)
-		req_set_fail_links(req);
 	fput(close->put_file);
 	close->put_file = NULL;
+done:
+	if (ret < 0)
+		req_set_fail_links(req);
 	__io_req_complete(req, ret, 0, cs);
 	return 0;
 }
@@ -4526,6 +4574,7 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	accept->addr_len = u64_to_user_ptr(READ_ONCE(sqe->addr2));
 	accept->flags = READ_ONCE(sqe->accept_flags);
 	accept->nofile = rlimit(RLIMIT_NOFILE);
+	accept->files_seq = req->ctx->cur_files_seq;
 	return 0;
 }
 
@@ -4536,6 +4585,10 @@ static int io_accept(struct io_kiocb *req, bool force_nonblock,
 	unsigned int file_flags = force_nonblock ? O_NONBLOCK : 0;
 	int ret;
 
+	ret = io_check_files_seq(req, &accept->files_seq);
+	if (ret)
+		goto done;
+
 	if (req->file->f_flags & O_NONBLOCK)
 		req->flags |= REQ_F_NOWAIT;
 
@@ -4544,6 +4597,7 @@ static int io_accept(struct io_kiocb *req, bool force_nonblock,
 					accept->nofile);
 	if (ret == -EAGAIN && force_nonblock)
 		return -EAGAIN;
+done:
 	if (ret < 0) {
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
@@ -5513,6 +5567,7 @@ static int io_files_update_prep(struct io_kiocb *req,
 	if (!req->files_update.nr_args)
 		return -EINVAL;
 	req->files_update.arg = READ_ONCE(sqe->addr);
+	req->files_update.files_seq = req->ctx->cur_files_seq;
 	return 0;
 }
 
@@ -5523,6 +5578,10 @@ static int io_files_update(struct io_kiocb *req, bool force_nonblock,
 	struct io_uring_files_update up;
 	int ret;
 
+	ret = io_check_files_seq(req, &req->files_update.files_seq);
+	if (ret)
+		goto done;
+
 	if (force_nonblock)
 		return -EAGAIN;
 
@@ -5532,7 +5591,7 @@ static int io_files_update(struct io_kiocb *req, bool force_nonblock,
 	mutex_lock(&ctx->uring_lock);
 	ret = __io_sqe_files_update(ctx, &up, req->files_update.nr_args);
 	mutex_unlock(&ctx->uring_lock);
-
+done:
 	if (ret < 0)
 		req_set_fail_links(req);
 	__io_req_complete(req, ret, 0, cs);
@@ -6118,34 +6177,21 @@ static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req,
 
 static int io_grab_files(struct io_kiocb *req)
 {
-	int ret = -EBADF;
 	struct io_ring_ctx *ctx = req->ctx;
 
 	io_req_init_async(req);
 
 	if (req->work.files || (req->flags & REQ_F_NO_FILE_TABLE))
 		return 0;
-	if (!ctx->ring_file)
-		return -EBADF;
 
-	rcu_read_lock();
 	spin_lock_irq(&ctx->inflight_lock);
-	/*
-	 * We use the f_ops->flush() handler to ensure that we can flush
-	 * out work accessing these files if the fd is closed. Check if
-	 * the fd has changed since we started down this path, and disallow
-	 * this operation if it has.
-	 */
-	if (fcheck(ctx->ring_fd) == ctx->ring_file) {
-		list_add(&req->inflight_entry, &ctx->inflight_list);
-		req->flags |= REQ_F_INFLIGHT;
-		req->work.files = current->files;
-		ret = 0;
-	}
+	list_add(&req->inflight_entry, &ctx->inflight_list);
+	req->flags |= REQ_F_INFLIGHT;
+	ctx->cur_files_seq = atomic_read(&ctx->files_seq);
+	req->work.files = current->files;
 	spin_unlock_irq(&ctx->inflight_lock);
-	rcu_read_unlock();
 
-	return ret;
+	return 0;
 }
 
 static inline int io_prep_work_files(struct io_kiocb *req)
@@ -6705,14 +6751,7 @@ static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
 		mutex_unlock(&ctx->uring_lock);
 	}
 
-	/*
-	 * If ->ring_file is NULL, we're waiting on new fd/file assigment.
-	 * Don't submit anything new until that happens.
-	 */
-	if (ctx->ring_file)
-		to_submit = io_sqring_entries(ctx);
-	else
-		to_submit = 0;
+	to_submit = io_sqring_entries(ctx);
 
 	/*
 	 * If submit got -EBUSY, flag us as needing the application
@@ -6756,7 +6795,7 @@ static enum sq_ret __io_sq_thread(struct io_ring_ctx *ctx,
 		}
 
 		to_submit = io_sqring_entries(ctx);
-		if (!to_submit || ret == -EBUSY || !ctx->ring_file)
+		if (!to_submit || ret == -EBUSY)
 			return SQT_IDLE;
 
 		finish_wait(&sqd->wait, &ctx->sqo_wait_entry);
@@ -8547,6 +8586,9 @@ static int io_uring_flush(struct file *file, void *data)
 {
 	struct io_ring_ctx *ctx = file->private_data;
 
+	/* assume current files sequence is no longer valid */
+	atomic_inc(&ctx->files_seq);
+
 	io_uring_cancel_files(ctx, data);
 
 	/*
@@ -8558,13 +8600,8 @@ static int io_uring_flush(struct file *file, void *data)
 	} else if (ctx->flags & IORING_SETUP_SQPOLL) {
 		struct io_sq_data *sqd = ctx->sq_data;
 
-		/* Ring is being closed, mark us as neding new assignment */
+		/* quiesce sqpoll thread */
 		io_sq_thread_park(sqd);
-		mutex_lock(&ctx->uring_lock);
-		ctx->ring_fd = -1;
-		ctx->ring_file = NULL;
-		mutex_unlock(&ctx->uring_lock);
-		io_ring_set_wakeup_flag(ctx);
 		io_sq_thread_unpark(sqd);
 	}
 
@@ -8701,18 +8738,6 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
 		if (!list_empty_careful(&ctx->cq_overflow_list))
 			io_cqring_overflow_flush(ctx, false);
-		if (fd != ctx->ring_fd) {
-			struct io_sq_data *sqd = ctx->sq_data;
-
-			io_sq_thread_park(sqd);
-
-			mutex_lock(&ctx->uring_lock);
-			ctx->ring_fd = fd;
-			ctx->ring_file = f.file;
-			mutex_unlock(&ctx->uring_lock);
-
-			io_sq_thread_unpark(sqd);
-		}
 		if (flags & IORING_ENTER_SQ_WAKEUP)
 			wake_up(&ctx->sq_data->wait);
 		if (flags & IORING_ENTER_SQ_WAIT)
@@ -8720,8 +8745,6 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 		submitted = to_submit;
 	} else if (to_submit) {
 		mutex_lock(&ctx->uring_lock);
-		ctx->ring_fd = fd;
-		ctx->ring_file = f.file;
 		submitted = io_submit_sqes(ctx, to_submit);
 		mutex_unlock(&ctx->uring_lock);
 
-- 
2.28.0


[-- Attachment #3: 0001-io_uring-stash-ctx-task-reference-instead-of-task-fi.patch --]
[-- Type: text/x-patch, Size: 3530 bytes --]

From e9194eaa20005913b3c39a5c5124c3f803e4074a Mon Sep 17 00:00:00 2001
From: Jens Axboe <[email protected]>
Date: Thu, 10 Sep 2020 16:01:15 -0600
Subject: [PATCH 1/2] io_uring: stash ctx task reference instead of task files

We can grab a reference to the task instead of stashing away the task
files_struct. This is doable without creating a circular reference
between the ring fd and the task itself.

This is in preparation for handling the ->files assignment a bit
differently, so we don't need to force SQPOLL to enter the kernel for
an update.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/io_uring.c | 33 ++++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 98cddcc03a16..5d0247875237 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -290,11 +290,10 @@ struct io_ring_ctx {
 	struct io_wq		*io_wq;
 	struct mm_struct	*sqo_mm;
 	/*
-	 * For SQPOLL usage - no reference is held to this file table, we
-	 * rely on fops->flush() and our callback there waiting for the users
-	 * to finish.
+	 * For SQPOLL usage - we hold a reference to the parent task, so we
+	 * have access to the ->files
 	 */
-	struct files_struct	*sqo_files;
+	struct task_struct	*sqo_task;
 
 	struct wait_queue_entry	sqo_wait_entry;
 	struct list_head	sqd_list;
@@ -6824,10 +6823,12 @@ static int io_sq_thread(void *data)
 				old_cred = override_creds(ctx->creds);
 			}
 
-			if (current->files != ctx->sqo_files) {
+			if (current->files != ctx->sqo_task->files) {
+				task_lock(ctx->sqo_task);
 				task_lock(current);
-				current->files = ctx->sqo_files;
+				current->files = ctx->sqo_task->files;
 				task_unlock(current);
+				task_unlock(ctx->sqo_task);
 			}
 
 			ret |= __io_sq_thread(ctx, start_jiffies, cap_entries);
@@ -7148,6 +7149,11 @@ static void io_finish_async(struct io_ring_ctx *ctx)
 		io_wq_destroy(ctx->io_wq);
 		ctx->io_wq = NULL;
 	}
+
+	if (ctx->sqo_task) {
+		put_task_struct(ctx->sqo_task);
+		ctx->sqo_task = NULL;
+	}
 }
 
 #if defined(CONFIG_UNIX)
@@ -7794,11 +7800,11 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		mutex_unlock(&sqd->ctx_lock);
 
 		/*
-		 * We will exit the sqthread before current exits, so we can
-		 * avoid taking a reference here and introducing weird
-		 * circular dependencies on the files table.
+		 * Grab task reference for SQPOLL usage. This doesn't
+		 * introduce a circular reference, as the task reference is
+		 * just to ensure that the struct itself stays valid.
 		 */
-		ctx->sqo_files = current->files;
+		ctx->sqo_task = get_task_struct(current);
 
 		ctx->sq_thread_idle = msecs_to_jiffies(p->sq_thread_idle);
 		if (!ctx->sq_thread_idle)
@@ -7840,7 +7846,10 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 
 	return 0;
 err:
-	ctx->sqo_files = NULL;
+	if (ctx->sqo_task) {
+		put_task_struct(ctx->sqo_task);
+		ctx->sqo_task = NULL;
+	}
 	io_finish_async(ctx);
 	return ret;
 }
@@ -8554,7 +8563,6 @@ static int io_uring_flush(struct file *file, void *data)
 		mutex_lock(&ctx->uring_lock);
 		ctx->ring_fd = -1;
 		ctx->ring_file = NULL;
-		ctx->sqo_files = NULL;
 		mutex_unlock(&ctx->uring_lock);
 		io_ring_set_wakeup_flag(ctx);
 		io_sq_thread_unpark(sqd);
@@ -8701,7 +8709,6 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 			mutex_lock(&ctx->uring_lock);
 			ctx->ring_fd = fd;
 			ctx->ring_file = f.file;
-			ctx->sqo_files = current->files;
 			mutex_unlock(&ctx->uring_lock);
 
 			io_sq_thread_unpark(sqd);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-10 22:11                       ` Jens Axboe
@ 2020-09-10 23:04                         ` Jens Axboe
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2020-09-10 23:04 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Jann Horn

On 9/10/20 4:11 PM, Jens Axboe wrote:
> On 9/10/20 3:01 PM, Jens Axboe wrote:
>> On 9/10/20 12:18 PM, Jens Axboe wrote:
>>> On 9/10/20 7:11 AM, Jens Axboe wrote:
>>>> On 9/10/20 6:37 AM, Pavel Begunkov wrote:
>>>>> On 09/09/2020 19:07, Jens Axboe wrote:
>>>>>> On 9/9/20 9:48 AM, Pavel Begunkov wrote:
>>>>>>> On 09/09/2020 16:10, Jens Axboe wrote:
>>>>>>>> On 9/9/20 1:09 AM, Pavel Begunkov wrote:
>>>>>>>>> On 09/09/2020 01:54, Jens Axboe wrote:
>>>>>>>>>> On 9/8/20 3:22 PM, Jens Axboe wrote:
>>>>>>>>>>> On 9/8/20 2:58 PM, Pavel Begunkov wrote:
>>>>>>>>>>>> On 08/09/2020 20:48, Jens Axboe wrote:
>>>>>>>>>>>>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
>>>>>>>>>>>>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
>>>>>>>>>>>>> the ring fd/file appropriately so we can defer grab them.
>>>>>>>>>>>>
>>>>>>>>>>>> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
>>>>>>>>>>>> that isn't the case with SQPOLL threads. Am I mistaken?
>>>>>>>>>>>>
>>>>>>>>>>>> And it looks strange that the following snippet will effectively disable
>>>>>>>>>>>> such requests.
>>>>>>>>>>>>
>>>>>>>>>>>> fd = dup(ring_fd)
>>>>>>>>>>>> close(ring_fd)
>>>>>>>>>>>> ring_fd = fd
>>>>>>>>>>>
>>>>>>>>>>> Not disagreeing with that, I think my initial posting made it clear
>>>>>>>>>>> it was a hack. Just piled it in there for easier testing in terms
>>>>>>>>>>> of functionality.
>>>>>>>>>>>
>>>>>>>>>>> But the next question is how to do this right...> 
>>>>>>>>>> Looking at this a bit more, and I don't necessarily think there's a
>>>>>>>>>> better option. If you dup+close, then it just won't work. We have no
>>>>>>>>>> way of knowing if the 'fd' changed, but we can detect if it was closed
>>>>>>>>>> and then we'll end up just EBADF'ing the requests.
>>>>>>>>>>
>>>>>>>>>> So right now the answer is that we can support this just fine with
>>>>>>>>>> SQPOLL, but you better not dup and close the original fd. Which is not
>>>>>>>>>> ideal, but better than NOT being able to support it.
>>>>>>>>>>
>>>>>>>>>> Only other option I see is to to provide an io_uring_register()
>>>>>>>>>> command to update the fd/file associated with it. Which may be useful,
>>>>>>>>>> it allows a process to indeed to this, if it absolutely has to.
>>>>>>>>>
>>>>>>>>> Let's put aside such dirty hacks, at least until someone actually
>>>>>>>>> needs it. Ideally, for many reasons I'd prefer to get rid of
>>>>>>>>
>>>>>>>> BUt it is actually needed, otherwise we're even more in a limbo state of
>>>>>>>> "SQPOLL works for most things now, just not all". And this isn't that
>>>>>>>> hard to make right - on the flush() side, we just need to park/stall the
>>>>>>>
>>>>>>> I understand that it isn't hard, but I just don't want to expose it to
>>>>>>> the userspace, a) because it's a userspace API, so couldn't probably be
>>>>>>> killed in the future, b) works around kernel's problems, and so
>>>>>>> shouldn't really be exposed to the userspace in normal circumstances.
>>>>>>>
>>>>>>> And it's not generic enough because of a possible "many fds -> single
>>>>>>> file" mapping, and there will be a lot of questions and problems.
>>>>>>>
>>>>>>> e.g. if a process shares a io_uring with another process, then
>>>>>>> dup()+close() would require not only this hook but also additional
>>>>>>> inter-process synchronisation. And so on.
>>>>>>
>>>>>> I think you're blowing this out of proportion. Just to restate the
>>>>>
>>>>> I just think that if there is a potentially cleaner solution without
>>>>> involving userspace, we should try to look for it first, even if it
>>>>> would take more time. That was the point.
>>>>
>>>> Regardless of whether or not we can eliminate that need, at least it'll
>>>> be a relaxing of the restriction, not an increase of it. It'll never
>>>> hurt to do an extra system call for the case where you're swapping fds.
>>>> I do get your point, I just don't think it's a big deal.
>>>
>>> BTW, I don't see how we can ever get rid of a need to enter the kernel,
>>> we'd need some chance at grabbing the updated ->files, for instance.
>>> Might be possible to hold a reference to the task and grab it from
>>> there, though feels a bit iffy to hold a task reference from the ring on
>>> the task that holds a reference to the ring. Haven't looked too close,
>>> should work though as this won't hold a file/files reference, it's just
>>> a freeing reference.
>>
>> Sort of half assed attempt...
>>
>> Idea is to assign a ->files sequence before we grab files, and then
>> compare with the current one once we need to use the files. If they
>> mismatch, we -ECANCELED the request.
>>
>> For SQPOLL, don't grab ->files upfront, grab a reference to the task
>> instead. Use the task reference to assign files when we need it.
>>
>> Adding Jann to help poke holes in this scheme. I'd be surprised if it's
>> solid as-is, but hopefully we can build on this idea and get rid of the
>> fcheck().
> 
> Split it into two, to make it easier to reason about. Added a few
> comments, etc.

Pushed to a temp branch, made a few more edits (forgot to wire up open).

https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-files_struct

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-10 18:18                   ` Jens Axboe
  2020-09-10 21:01                     ` Jens Axboe
@ 2020-09-11 19:23                     ` Pavel Begunkov
  2020-09-11 20:06                       ` Jens Axboe
  1 sibling, 1 reply; 18+ messages in thread
From: Pavel Begunkov @ 2020-09-11 19:23 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 10/09/2020 21:18, Jens Axboe wrote:
> On 9/10/20 7:11 AM, Jens Axboe wrote:
>> On 9/10/20 6:37 AM, Pavel Begunkov wrote:
>>> On 09/09/2020 19:07, Jens Axboe wrote:
>>>> On 9/9/20 9:48 AM, Pavel Begunkov wrote:
>>>>> On 09/09/2020 16:10, Jens Axboe wrote:
>>>>>> On 9/9/20 1:09 AM, Pavel Begunkov wrote:
>>>>>>> On 09/09/2020 01:54, Jens Axboe wrote:
>>>>>>>> On 9/8/20 3:22 PM, Jens Axboe wrote:
>>>>>>>>> On 9/8/20 2:58 PM, Pavel Begunkov wrote:
>>>>>>>>>> On 08/09/2020 20:48, Jens Axboe wrote:
>>>>>>>>>>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
>>>>>>>>>>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
>>>>>>>>>>> the ring fd/file appropriately so we can defer grab them.
>>>>>>>>>>
>>>>>>>>>> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
>>>>>>>>>> that isn't the case with SQPOLL threads. Am I mistaken?
>>>>>>>>>>
>>>>>>>>>> And it looks strange that the following snippet will effectively disable
>>>>>>>>>> such requests.
>>>>>>>>>>
>>>>>>>>>> fd = dup(ring_fd)
>>>>>>>>>> close(ring_fd)
>>>>>>>>>> ring_fd = fd
>>>>>>>>>
>>>>>>>>> Not disagreeing with that, I think my initial posting made it clear
>>>>>>>>> it was a hack. Just piled it in there for easier testing in terms
>>>>>>>>> of functionality.
>>>>>>>>>
>>>>>>>>> But the next question is how to do this right...> 
>>>>>>>> Looking at this a bit more, and I don't necessarily think there's a
>>>>>>>> better option. If you dup+close, then it just won't work. We have no
>>>>>>>> way of knowing if the 'fd' changed, but we can detect if it was closed
>>>>>>>> and then we'll end up just EBADF'ing the requests.
>>>>>>>>
>>>>>>>> So right now the answer is that we can support this just fine with
>>>>>>>> SQPOLL, but you better not dup and close the original fd. Which is not
>>>>>>>> ideal, but better than NOT being able to support it.
>>>>>>>>
>>>>>>>> Only other option I see is to to provide an io_uring_register()
>>>>>>>> command to update the fd/file associated with it. Which may be useful,
>>>>>>>> it allows a process to indeed to this, if it absolutely has to.
>>>>>>>
>>>>>>> Let's put aside such dirty hacks, at least until someone actually
>>>>>>> needs it. Ideally, for many reasons I'd prefer to get rid of
>>>>>>
>>>>>> BUt it is actually needed, otherwise we're even more in a limbo state of
>>>>>> "SQPOLL works for most things now, just not all". And this isn't that
>>>>>> hard to make right - on the flush() side, we just need to park/stall the
>>>>>
>>>>> I understand that it isn't hard, but I just don't want to expose it to
>>>>> the userspace, a) because it's a userspace API, so couldn't probably be
>>>>> killed in the future, b) works around kernel's problems, and so
>>>>> shouldn't really be exposed to the userspace in normal circumstances.
>>>>>
>>>>> And it's not generic enough because of a possible "many fds -> single
>>>>> file" mapping, and there will be a lot of questions and problems.
>>>>>
>>>>> e.g. if a process shares a io_uring with another process, then
>>>>> dup()+close() would require not only this hook but also additional
>>>>> inter-process synchronisation. And so on.
>>>>
>>>> I think you're blowing this out of proportion. Just to restate the
>>>
>>> I just think that if there is a potentially cleaner solution without
>>> involving userspace, we should try to look for it first, even if it
>>> would take more time. That was the point.
>>
>> Regardless of whether or not we can eliminate that need, at least it'll
>> be a relaxing of the restriction, not an increase of it. It'll never
>> hurt to do an extra system call for the case where you're swapping fds.
>> I do get your point, I just don't think it's a big deal.
> 
> BTW, I don't see how we can ever get rid of a need to enter the kernel,
> we'd need some chance at grabbing the updated ->files, for instance.

Thanks for taking a look.
Yeah, agree, it should get it from somewhere, and that reminds me that
we have a similar situation with sqo_mm -- it grabs it from the
task-creator and keeps it to the end... Do we really need to set
->files of another thread? Retaining to the end seem to work well
enough with mm. And we need, then it would be more consistent
to replace mm there as well.

> Might be possible to hold a reference to the task and grab it from
> there, though feels a bit iffy to hold a task reference from the ring on
> the task that holds a reference to the ring. Haven't looked too close,
> should work though as this won't hold a file/files reference, it's just
> a freeing reference.

BTW, if the process-creator dies, then its ->files might be killed
and ->sqo_files become dangling, so should be invalidated. Your
approach with a task's reference probably handles it naturally.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL
  2020-09-11 19:23                     ` Pavel Begunkov
@ 2020-09-11 20:06                       ` Jens Axboe
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2020-09-11 20:06 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 9/11/20 1:23 PM, Pavel Begunkov wrote:
> On 10/09/2020 21:18, Jens Axboe wrote:
>> On 9/10/20 7:11 AM, Jens Axboe wrote:
>>> On 9/10/20 6:37 AM, Pavel Begunkov wrote:
>>>> On 09/09/2020 19:07, Jens Axboe wrote:
>>>>> On 9/9/20 9:48 AM, Pavel Begunkov wrote:
>>>>>> On 09/09/2020 16:10, Jens Axboe wrote:
>>>>>>> On 9/9/20 1:09 AM, Pavel Begunkov wrote:
>>>>>>>> On 09/09/2020 01:54, Jens Axboe wrote:
>>>>>>>>> On 9/8/20 3:22 PM, Jens Axboe wrote:
>>>>>>>>>> On 9/8/20 2:58 PM, Pavel Begunkov wrote:
>>>>>>>>>>> On 08/09/2020 20:48, Jens Axboe wrote:
>>>>>>>>>>>> Fd instantiating commands like IORING_OP_ACCEPT now work with SQPOLL, but
>>>>>>>>>>>> we have an error in grabbing that if IOSQE_ASYNC is set. Ensure we assign
>>>>>>>>>>>> the ring fd/file appropriately so we can defer grab them.
>>>>>>>>>>>
>>>>>>>>>>> IIRC, for fcheck() in io_grab_files() to work it should be under fdget(),
>>>>>>>>>>> that isn't the case with SQPOLL threads. Am I mistaken?
>>>>>>>>>>>
>>>>>>>>>>> And it looks strange that the following snippet will effectively disable
>>>>>>>>>>> such requests.
>>>>>>>>>>>
>>>>>>>>>>> fd = dup(ring_fd)
>>>>>>>>>>> close(ring_fd)
>>>>>>>>>>> ring_fd = fd
>>>>>>>>>>
>>>>>>>>>> Not disagreeing with that, I think my initial posting made it clear
>>>>>>>>>> it was a hack. Just piled it in there for easier testing in terms
>>>>>>>>>> of functionality.
>>>>>>>>>>
>>>>>>>>>> But the next question is how to do this right...> 
>>>>>>>>> Looking at this a bit more, and I don't necessarily think there's a
>>>>>>>>> better option. If you dup+close, then it just won't work. We have no
>>>>>>>>> way of knowing if the 'fd' changed, but we can detect if it was closed
>>>>>>>>> and then we'll end up just EBADF'ing the requests.
>>>>>>>>>
>>>>>>>>> So right now the answer is that we can support this just fine with
>>>>>>>>> SQPOLL, but you better not dup and close the original fd. Which is not
>>>>>>>>> ideal, but better than NOT being able to support it.
>>>>>>>>>
>>>>>>>>> Only other option I see is to to provide an io_uring_register()
>>>>>>>>> command to update the fd/file associated with it. Which may be useful,
>>>>>>>>> it allows a process to indeed to this, if it absolutely has to.
>>>>>>>>
>>>>>>>> Let's put aside such dirty hacks, at least until someone actually
>>>>>>>> needs it. Ideally, for many reasons I'd prefer to get rid of
>>>>>>>
>>>>>>> BUt it is actually needed, otherwise we're even more in a limbo state of
>>>>>>> "SQPOLL works for most things now, just not all". And this isn't that
>>>>>>> hard to make right - on the flush() side, we just need to park/stall the
>>>>>>
>>>>>> I understand that it isn't hard, but I just don't want to expose it to
>>>>>> the userspace, a) because it's a userspace API, so couldn't probably be
>>>>>> killed in the future, b) works around kernel's problems, and so
>>>>>> shouldn't really be exposed to the userspace in normal circumstances.
>>>>>>
>>>>>> And it's not generic enough because of a possible "many fds -> single
>>>>>> file" mapping, and there will be a lot of questions and problems.
>>>>>>
>>>>>> e.g. if a process shares a io_uring with another process, then
>>>>>> dup()+close() would require not only this hook but also additional
>>>>>> inter-process synchronisation. And so on.
>>>>>
>>>>> I think you're blowing this out of proportion. Just to restate the
>>>>
>>>> I just think that if there is a potentially cleaner solution without
>>>> involving userspace, we should try to look for it first, even if it
>>>> would take more time. That was the point.
>>>
>>> Regardless of whether or not we can eliminate that need, at least it'll
>>> be a relaxing of the restriction, not an increase of it. It'll never
>>> hurt to do an extra system call for the case where you're swapping fds.
>>> I do get your point, I just don't think it's a big deal.
>>
>> BTW, I don't see how we can ever get rid of a need to enter the kernel,
>> we'd need some chance at grabbing the updated ->files, for instance.
> 
> Thanks for taking a look.
> Yeah, agree, it should get it from somewhere, and that reminds me that
> we have a similar situation with sqo_mm -- it grabs it from the
> task-creator and keeps it to the end... Do we really need to set
> ->files of another thread? Retaining to the end seem to work well
> enough with mm. And we need, then it would be more consistent
> to replace mm there as well.

The files can change, so we need the juggling.

>> Might be possible to hold a reference to the task and grab it from
>> there, though feels a bit iffy to hold a task reference from the ring on
>> the task that holds a reference to the ring. Haven't looked too close,
>> should work though as this won't hold a file/files reference, it's just
>> a freeing reference.
> 
> BTW, if the process-creator dies, then its ->files might be killed
> and ->sqo_files become dangling, so should be invalidated. Your
> approach with a task's reference probably handles it naturally.

We do prune and cancel if the process goes away, so it shouldn't have
that issue. But yes, it falls out naturally with the task based
approach.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-09-11 20:06 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-09-08 17:48 [PATCH for-next] io_uring: ensure IOSQE_ASYNC file table grabbing works, with SQPOLL Jens Axboe
2020-09-08 20:58 ` Pavel Begunkov
2020-09-08 21:22   ` Jens Axboe
2020-09-08 22:54     ` Jens Axboe
2020-09-09  0:48       ` Josef
2020-09-09  7:09       ` Pavel Begunkov
2020-09-09 13:10         ` Jens Axboe
2020-09-09 13:53           ` Jens Axboe
2020-09-09 15:48           ` Pavel Begunkov
2020-09-09 16:07             ` Jens Axboe
2020-09-10 12:37               ` Pavel Begunkov
2020-09-10 13:11                 ` Jens Axboe
2020-09-10 18:18                   ` Jens Axboe
2020-09-10 21:01                     ` Jens Axboe
2020-09-10 22:11                       ` Jens Axboe
2020-09-10 23:04                         ` Jens Axboe
2020-09-11 19:23                     ` Pavel Begunkov
2020-09-11 20:06                       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox