Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task

public inbox for [email protected]
 help / color / mirror / Atom feed

* Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task
       [not found] ` <[email protected]>
@ 2020-01-24 10:38   ` Stefan Metzmacher
  2020-01-24 10:41     ` Stefan Metzmacher
                       ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Stefan Metzmacher @ 2020-01-24 10:38 UTC (permalink / raw)
  To: Greg Kroah-Hartman, linux-kernel; +Cc: stable, Jens Axboe, io-uring

[-- Attachment #1.1: Type: text/plain, Size: 3253 bytes --]

Am 22.01.20 um 10:26 schrieb Greg Kroah-Hartman:
> From: Jens Axboe <[email protected]>
> 
> commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream.
> 
> If the credentials or the mm doesn't match, don't allow the task to
> submit anything on behalf of this ring. The task that owns the ring can
> pass the file descriptor to another task, but we don't want to allow
> that task to submit an SQE that then assumes the ring mm and creds if
> it needs to go async.
> 
> Cc: [email protected]
> Suggested-by: Stefan Metzmacher <[email protected]>
> Signed-off-by: Jens Axboe <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
> 
> 
> ---
>  fs/io_uring.c |    6 ++++++
>  1 file changed, 6 insertions(+)
> 
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -3716,6 +3716,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
>  			wake_up(&ctx->sqo_wait);
>  		submitted = to_submit;
>  	} else if (to_submit) {
> +		if (current->mm != ctx->sqo_mm ||
> +		    current_cred() != ctx->creds) {
> +			ret = -EPERM;
> +			goto out;
> +		}
> +

I thought about this a bit more.

I'm not sure if this is actually to restrictive,
because it means applications like Samba won't
be able to use io-uring at all.

As even if current_cred() and ctx->creds describe the same
set of uid,gids the != won't ever match again and
makes the whole ring unuseable.

I'm not sure about what the best short term solution could be...

1. May just doing the check for path based operations?
  and fail individual requests with EPERM.

2. Or force REQ_F_FORCE_ASYNC for path based operations,
  so that they're always executed from within the workqueue
  with were ctx->creds is active.

3. Or (as proposed earlier) do the override_creds/revert_creds dance
  (and similar for mm) if needed.

To summaries the problem again:

For path based operations like:
- IORING_OP_CONNECT (maybe also - IORING_OP_ACCEPT???)
- IORING_OP_SEND*, IORING_OP_RECV* on DGRAM sockets
- IORING_OP_OPENAT, IORING_OP_STATX, IORING_OP_OPENAT2
it's important under which current_cred they are called.

Are IORING_OP_MADVISE, IORING_OP_FADVISE and IORING_OP_FALLOCATE
are only bound to the credentials of the passed fd they operate on?

The current assumption is that the io_uring_setup() syscall captures
the current_cred() to ctx->cred and all operations on the ring
are executed under the context of ctx->cred.
Therefore all helper threads do the override_creds/revert_creds dance.

But the possible non-blocking line execution of operations in
the io_uring_enter() syscall doesn't do the override_creds/revert_creds
dance and execute the operations under current_cred().

This means it's random depending on filled cached under what
credentials an operation is executed.

In order to prevent security problems the current patch is enough,
but as outlined above it will make io-uring complete unuseable
for applications using any syscall that changes current_cred().

Change 1. would be a little bit better, but still not really useful.

I'd actually prefer solution 3. as it's still possible to make
use of non-blocking operations, while the security is the
same as solution 2.

metze

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task
  2020-01-24 10:38   ` [PATCH 5.4 033/222] io_uring: only allow submit from owning task Stefan Metzmacher
@ 2020-01-24 10:41     ` Stefan Metzmacher
  2020-01-24 16:58     ` Jens Axboe
  2020-01-26  5:54     ` Andres Freund
  2 siblings, 0 replies; 7+ messages in thread
From: Stefan Metzmacher @ 2020-01-24 10:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, linux-kernel; +Cc: stable, Jens Axboe, io-uring


[-- Attachment #1.1: Type: text/plain, Size: 1520 bytes --]

Am 24.01.20 um 11:38 schrieb Stefan Metzmacher:
> Am 22.01.20 um 10:26 schrieb Greg Kroah-Hartman:
>> From: Jens Axboe <[email protected]>
>>
>> commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream.
>>
>> If the credentials or the mm doesn't match, don't allow the task to
>> submit anything on behalf of this ring. The task that owns the ring can
>> pass the file descriptor to another task, but we don't want to allow
>> that task to submit an SQE that then assumes the ring mm and creds if
>> it needs to go async.
>>
>> Cc: [email protected]
>> Suggested-by: Stefan Metzmacher <[email protected]>
>> Signed-off-by: Jens Axboe <[email protected]>
>> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>>
>>
>> ---
>>  fs/io_uring.c |    6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -3716,6 +3716,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
>>  			wake_up(&ctx->sqo_wait);
>>  		submitted = to_submit;
>>  	} else if (to_submit) {
>> +		if (current->mm != ctx->sqo_mm ||
>> +		    current_cred() != ctx->creds) {
>> +			ret = -EPERM;
>> +			goto out;
>> +		}
>> +
> 
> I thought about this a bit more.
> 
> I'm not sure if this is actually to restrictive,
> because it means applications like Samba won't
> be able to use io-uring at all.

Even for simple operations like IORING_OP_READ*, IORING_OP_WRITE*,
IORING_OP_FSYNC and IORING_OP_SYNC_FILE_RANGE, which only operate
on the given fd.

metze


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task
  2020-01-24 10:38   ` [PATCH 5.4 033/222] io_uring: only allow submit from owning task Stefan Metzmacher
  2020-01-24 10:41     ` Stefan Metzmacher
@ 2020-01-24 16:58     ` Jens Axboe
  2020-01-24 19:11       ` Stefan Metzmacher
  2020-01-26  5:54     ` Andres Freund
  2 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2020-01-24 16:58 UTC (permalink / raw)
  To: Stefan Metzmacher, Greg Kroah-Hartman, linux-kernel; +Cc: stable, io-uring

On 1/24/20 3:38 AM, Stefan Metzmacher wrote:
> Am 22.01.20 um 10:26 schrieb Greg Kroah-Hartman:
>> From: Jens Axboe <[email protected]>
>>
>> commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream.
>>
>> If the credentials or the mm doesn't match, don't allow the task to
>> submit anything on behalf of this ring. The task that owns the ring can
>> pass the file descriptor to another task, but we don't want to allow
>> that task to submit an SQE that then assumes the ring mm and creds if
>> it needs to go async.
>>
>> Cc: [email protected]
>> Suggested-by: Stefan Metzmacher <[email protected]>
>> Signed-off-by: Jens Axboe <[email protected]>
>> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>>
>>
>> ---
>>  fs/io_uring.c |    6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -3716,6 +3716,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
>>  			wake_up(&ctx->sqo_wait);
>>  		submitted = to_submit;
>>  	} else if (to_submit) {
>> +		if (current->mm != ctx->sqo_mm ||
>> +		    current_cred() != ctx->creds) {
>> +			ret = -EPERM;
>> +			goto out;
>> +		}
>> +
> 
> I thought about this a bit more.
> 
> I'm not sure if this is actually to restrictive,
> because it means applications like Samba won't
> be able to use io-uring at all.
> 
> As even if current_cred() and ctx->creds describe the same
> set of uid,gids the != won't ever match again and
> makes the whole ring unuseable.
> 
> I'm not sure about what the best short term solution could be...
> 
> 1. May just doing the check for path based operations?
>   and fail individual requests with EPERM.
> 
> 2. Or force REQ_F_FORCE_ASYNC for path based operations,
>   so that they're always executed from within the workqueue
>   with were ctx->creds is active.
> 
> 3. Or (as proposed earlier) do the override_creds/revert_creds dance
>   (and similar for mm) if needed.
> 
> To summaries the problem again:
> 
> For path based operations like:
> - IORING_OP_CONNECT (maybe also - IORING_OP_ACCEPT???)
> - IORING_OP_SEND*, IORING_OP_RECV* on DGRAM sockets
> - IORING_OP_OPENAT, IORING_OP_STATX, IORING_OP_OPENAT2
> it's important under which current_cred they are called.
> 
> Are IORING_OP_MADVISE, IORING_OP_FADVISE and IORING_OP_FALLOCATE
> are only bound to the credentials of the passed fd they operate on?
> 
> The current assumption is that the io_uring_setup() syscall captures
> the current_cred() to ctx->cred and all operations on the ring
> are executed under the context of ctx->cred.
> Therefore all helper threads do the override_creds/revert_creds dance.

But it doesn't - we're expecting them to match, and with this change,
we assert that it's the case or return -EPERM.

> But the possible non-blocking line execution of operations in
> the io_uring_enter() syscall doesn't do the override_creds/revert_creds
> dance and execute the operations under current_cred().
> 
> This means it's random depending on filled cached under what
> credentials an operation is executed.
> 
> In order to prevent security problems the current patch is enough,
> but as outlined above it will make io-uring complete unuseable
> for applications using any syscall that changes current_cred().
> 
> Change 1. would be a little bit better, but still not really useful.
> 
> I'd actually prefer solution 3. as it's still possible to make
> use of non-blocking operations, while the security is the
> same as solution 2.

For your situation, we need to extend it anyway, and provide a way
to swap between personalities. So yeah it won't work as-is for your
use case, but we can work on making that the case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task
  2020-01-24 16:58     ` Jens Axboe
@ 2020-01-24 19:11       ` Stefan Metzmacher
  2020-01-24 21:41         ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Stefan Metzmacher @ 2020-01-24 19:11 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel; +Cc: stable, io-uring


[-- Attachment #1.1: Type: text/plain, Size: 4464 bytes --]

Am 24.01.20 um 17:58 schrieb Jens Axboe:
> On 1/24/20 3:38 AM, Stefan Metzmacher wrote:
>> Am 22.01.20 um 10:26 schrieb Greg Kroah-Hartman:
>>> From: Jens Axboe <[email protected]>
>>>
>>> commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream.
>>>
>>> If the credentials or the mm doesn't match, don't allow the task to
>>> submit anything on behalf of this ring. The task that owns the ring can
>>> pass the file descriptor to another task, but we don't want to allow
>>> that task to submit an SQE that then assumes the ring mm and creds if
>>> it needs to go async.
>>>
>>> Cc: [email protected]
>>> Suggested-by: Stefan Metzmacher <[email protected]>
>>> Signed-off-by: Jens Axboe <[email protected]>
>>> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>>>
>>>
>>> ---
>>>  fs/io_uring.c |    6 ++++++
>>>  1 file changed, 6 insertions(+)
>>>
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -3716,6 +3716,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
>>>  			wake_up(&ctx->sqo_wait);
>>>  		submitted = to_submit;
>>>  	} else if (to_submit) {
>>> +		if (current->mm != ctx->sqo_mm ||
>>> +		    current_cred() != ctx->creds) {
>>> +			ret = -EPERM;
>>> +			goto out;
>>> +		}
>>> +
>>
>> I thought about this a bit more.
>>
>> I'm not sure if this is actually to restrictive,
>> because it means applications like Samba won't
>> be able to use io-uring at all.
>>
>> As even if current_cred() and ctx->creds describe the same
>> set of uid,gids the != won't ever match again and
>> makes the whole ring unuseable.
>>
>> I'm not sure about what the best short term solution could be...
>>
>> 1. May just doing the check for path based operations?
>>   and fail individual requests with EPERM.
>>
>> 2. Or force REQ_F_FORCE_ASYNC for path based operations,
>>   so that they're always executed from within the workqueue
>>   with were ctx->creds is active.
>>
>> 3. Or (as proposed earlier) do the override_creds/revert_creds dance
>>   (and similar for mm) if needed.
>>
>> To summaries the problem again:
>>
>> For path based operations like:
>> - IORING_OP_CONNECT (maybe also - IORING_OP_ACCEPT???)
>> - IORING_OP_SEND*, IORING_OP_RECV* on DGRAM sockets
>> - IORING_OP_OPENAT, IORING_OP_STATX, IORING_OP_OPENAT2
>> it's important under which current_cred they are called.
>>
>> Are IORING_OP_MADVISE, IORING_OP_FADVISE and IORING_OP_FALLOCATE
>> are only bound to the credentials of the passed fd they operate on?
>>
>> The current assumption is that the io_uring_setup() syscall captures
>> the current_cred() to ctx->cred and all operations on the ring
>> are executed under the context of ctx->cred.
>> Therefore all helper threads do the override_creds/revert_creds dance.
> 
> But it doesn't - we're expecting them to match, and with this change,
> we assert that it's the case or return -EPERM.
> 
>> But the possible non-blocking line execution of operations in
>> the io_uring_enter() syscall doesn't do the override_creds/revert_creds
>> dance and execute the operations under current_cred().
>>
>> This means it's random depending on filled cached under what
>> credentials an operation is executed.
>>
>> In order to prevent security problems the current patch is enough,
>> but as outlined above it will make io-uring complete unuseable
>> for applications using any syscall that changes current_cred().
>>
>> Change 1. would be a little bit better, but still not really useful.
>>
>> I'd actually prefer solution 3. as it's still possible to make
>> use of non-blocking operations, while the security is the
>> same as solution 2.
> 
> For your situation, we need to extend it anyway, and provide a way
> to swap between personalities. So yeah it won't work as-is for your
> use case, but we can work on making that the case.

That's only for the OPENAT2 case, which we might want to use in future,
but there's a lot of work required to have async opens in Samba.

But I have a experimental module that, just use READV, WRITEV and FSYNC
with io-uring in order to avoid our userspace helper threads.

And that won't work anymore with the change as Samba change
current_cred() very often switch between (at least) 2 identities
root and the user. That will change the pointer of current_cred() each time.

I mean I could work around the check by using IORING_SETUP_SQPOLL,
but I'd like to avoid that.

metze


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task
  2020-01-24 19:11       ` Stefan Metzmacher
@ 2020-01-24 21:41         ` Jens Axboe
  0 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2020-01-24 21:41 UTC (permalink / raw)
  To: Stefan Metzmacher, Greg Kroah-Hartman, linux-kernel; +Cc: stable, io-uring

On 1/24/20 12:11 PM, Stefan Metzmacher wrote:
> Am 24.01.20 um 17:58 schrieb Jens Axboe:
>> On 1/24/20 3:38 AM, Stefan Metzmacher wrote:
>>> Am 22.01.20 um 10:26 schrieb Greg Kroah-Hartman:
>>>> From: Jens Axboe <[email protected]>
>>>>
>>>> commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream.
>>>>
>>>> If the credentials or the mm doesn't match, don't allow the task to
>>>> submit anything on behalf of this ring. The task that owns the ring can
>>>> pass the file descriptor to another task, but we don't want to allow
>>>> that task to submit an SQE that then assumes the ring mm and creds if
>>>> it needs to go async.
>>>>
>>>> Cc: [email protected]
>>>> Suggested-by: Stefan Metzmacher <[email protected]>
>>>> Signed-off-by: Jens Axboe <[email protected]>
>>>> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>>>>
>>>>
>>>> ---
>>>>  fs/io_uring.c |    6 ++++++
>>>>  1 file changed, 6 insertions(+)
>>>>
>>>> --- a/fs/io_uring.c
>>>> +++ b/fs/io_uring.c
>>>> @@ -3716,6 +3716,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
>>>>  			wake_up(&ctx->sqo_wait);
>>>>  		submitted = to_submit;
>>>>  	} else if (to_submit) {
>>>> +		if (current->mm != ctx->sqo_mm ||
>>>> +		    current_cred() != ctx->creds) {
>>>> +			ret = -EPERM;
>>>> +			goto out;
>>>> +		}
>>>> +
>>>
>>> I thought about this a bit more.
>>>
>>> I'm not sure if this is actually to restrictive,
>>> because it means applications like Samba won't
>>> be able to use io-uring at all.
>>>
>>> As even if current_cred() and ctx->creds describe the same
>>> set of uid,gids the != won't ever match again and
>>> makes the whole ring unuseable.
>>>
>>> I'm not sure about what the best short term solution could be...
>>>
>>> 1. May just doing the check for path based operations?
>>>   and fail individual requests with EPERM.
>>>
>>> 2. Or force REQ_F_FORCE_ASYNC for path based operations,
>>>   so that they're always executed from within the workqueue
>>>   with were ctx->creds is active.
>>>
>>> 3. Or (as proposed earlier) do the override_creds/revert_creds dance
>>>   (and similar for mm) if needed.
>>>
>>> To summaries the problem again:
>>>
>>> For path based operations like:
>>> - IORING_OP_CONNECT (maybe also - IORING_OP_ACCEPT???)
>>> - IORING_OP_SEND*, IORING_OP_RECV* on DGRAM sockets
>>> - IORING_OP_OPENAT, IORING_OP_STATX, IORING_OP_OPENAT2
>>> it's important under which current_cred they are called.
>>>
>>> Are IORING_OP_MADVISE, IORING_OP_FADVISE and IORING_OP_FALLOCATE
>>> are only bound to the credentials of the passed fd they operate on?
>>>
>>> The current assumption is that the io_uring_setup() syscall captures
>>> the current_cred() to ctx->cred and all operations on the ring
>>> are executed under the context of ctx->cred.
>>> Therefore all helper threads do the override_creds/revert_creds dance.
>>
>> But it doesn't - we're expecting them to match, and with this change,
>> we assert that it's the case or return -EPERM.
>>
>>> But the possible non-blocking line execution of operations in
>>> the io_uring_enter() syscall doesn't do the override_creds/revert_creds
>>> dance and execute the operations under current_cred().
>>>
>>> This means it's random depending on filled cached under what
>>> credentials an operation is executed.
>>>
>>> In order to prevent security problems the current patch is enough,
>>> but as outlined above it will make io-uring complete unuseable
>>> for applications using any syscall that changes current_cred().
>>>
>>> Change 1. would be a little bit better, but still not really useful.
>>>
>>> I'd actually prefer solution 3. as it's still possible to make
>>> use of non-blocking operations, while the security is the
>>> same as solution 2.
>>
>> For your situation, we need to extend it anyway, and provide a way
>> to swap between personalities. So yeah it won't work as-is for your
>> use case, but we can work on making that the case.
> 
> That's only for the OPENAT2 case, which we might want to use in future,
> but there's a lot of work required to have async opens in Samba.
> 
> But I have a experimental module that, just use READV, WRITEV and FSYNC
> with io-uring in order to avoid our userspace helper threads.
> 
> And that won't work anymore with the change as Samba change
> current_cred() very often switch between (at least) 2 identities
> root and the user. That will change the pointer of current_cred() each time.
> 
> I mean I could work around the check by using IORING_SETUP_SQPOLL,
> but I'd like to avoid that.

It's easy enough to support the current creds from io_uring_enter(),
where we need a bit of plumbing is if we have to go async for that
particular operation. We currently have that static as well, which is
why the current patch is needed.

What I'm trying to say is that'll we'll need code changes to support
this in any case, even just reverting that change isn't going to make
the problem go away for you.

Hence we just need to decide on what the best way to do this would be!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task
  2020-01-24 10:38   ` [PATCH 5.4 033/222] io_uring: only allow submit from owning task Stefan Metzmacher
  2020-01-24 10:41     ` Stefan Metzmacher
  2020-01-24 16:58     ` Jens Axboe
@ 2020-01-26  5:54     ` Andres Freund
  2020-01-26 16:57       ` Jens Axboe
  2 siblings, 1 reply; 7+ messages in thread
From: Andres Freund @ 2020-01-26  5:54 UTC (permalink / raw)
  To: Jens Axboe, Stefan Metzmacher
  Cc: Greg Kroah-Hartman, linux-kernel, stable, io-uring

Hi,

On 2020-01-24 11:38:02 +0100, Stefan Metzmacher wrote:
> Am 22.01.20 um 10:26 schrieb Greg Kroah-Hartman:
> > From: Jens Axboe <[email protected]>
> >
> > commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream.
> >
> > If the credentials or the mm doesn't match, don't allow the task to
> > submit anything on behalf of this ring. The task that owns the ring can
> > pass the file descriptor to another task, but we don't want to allow
> > that task to submit an SQE that then assumes the ring mm and creds if
> > it needs to go async.
> >
> > Cc: [email protected]
> > Suggested-by: Stefan Metzmacher <[email protected]>
> > Signed-off-by: Jens Axboe <[email protected]>
> > Signed-off-by: Greg Kroah-Hartman <[email protected]>
> >
> >
> > ---
> >  fs/io_uring.c |    6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -3716,6 +3716,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
> >  			wake_up(&ctx->sqo_wait);
> >  		submitted = to_submit;
> >  	} else if (to_submit) {
> > +		if (current->mm != ctx->sqo_mm ||
> > +		    current_cred() != ctx->creds) {
> > +			ret = -EPERM;
> > +			goto out;
> > +		}
> > +
>
> I thought about this a bit more.
>
> I'm not sure if this is actually to restrictive,
> because it means applications like Samba won't
> be able to use io-uring at all.

Yea, I think it is too restrictive. In fact, it broke my WIP branch to
make postgres use io_uring.

Postgres uses a forked process model, with all sub-processes forked off
one parent process ("postmaster"), sharing MAP_ANONYMOUS|MAP_SHARED
memory (buffer pool, locks, and lots of other IPC). My WIP branch so far
has postmaster create a number of io_urings that then the different
processes can use (with locking if necessary).

In plenty of the cases it's fairly important for performance to not
require an additional context switch initiate IO, therefore we cannot
delegate submitting to an io_uring to separate process. But it's not
feasible to have one (or even two) urings for each process either: For
one, that's just about guaranteed to bring us over the default
RLIMIT_MEMLOCK limit, and requiring root only config changes is not an
option for many (nor user friendly).

Not sharing queues makes it basically impossible to rely on io_uring
ordering properties when operation interlock is needed. E.g. to
guarantee that the journal is flushed before some data buffer can be
written back, being able to make use of links and drains is great - but
there's one journal for all processes. To be able to guarantee anything,
all the interlocked writes need to go through one io_uring. I've not yet
implemented this, so I don't have numbers, but I expect pretty
significant savings.

Not being able to share urings also makes it harder to resolve
deadlocks:

As we call into both library and user defined code, we cannot guarantee
that a specific backend process will promptly (or at all, when waiting
for some locks) process cqes. There's also sections where we don't want
to constantly check for ready events, for performance reasons.  But
operations initiated by a process might be blocking other connections:

E.g. process #1 might have initiated transferring a number of blocks
into postgres' buffer pool via io_uring , and now is busy processing the
first block that completed. But now process #2 might need one of the
buffers that had IO queued, but didn't complete in time for #1 to see
the results.  The way I have it set up right now, #2 simply can process
pending cqes in the relevant queue. Which, in this example, would mark
the pg buffer pool entry as valid, allowing #2 to continue.

Now, completions can still be read by all processes, so I could continue
to do the above: But that'd require all potentially needed io_urings to
be set up in postmaster, before the first fork, and all processes to
keep all those FDs open (commonly several hundred). Not an attractive
option either, imo.

Obviously we could solve this by having a sqe result processing thread
running within each process - but that'd be a very significant new
overhead. And it'd require making some non-threadsafe code threadsafe,
which I do not relish tackling as a side-effect of io_uring adoption.

It also turns out to be nice from a performance / context-switch rate
angle to be able to process cqes for submissions by other
processes. Saves an expensive context switch, and often enough it really
doesn't matter which process processes the completion (especially for
readahead). And in other cases it's cheap to just schedule the
subsequent work from the cqe processor, e.g. initiating readahead of a
few more blocks into the pg buffer pool.  Similarly, there are a few
cases where it's useful for several processes to submit IO into a uring
primarily drained by one specific process, to offload the subsequent
action, if that's expensive

Now, I think there's a valid argument to be made that postgres should
just use threads, and not be hampered by any of this. But a) that's not
going to happen all that soon, it's a large change, b) it's far from
free from problems either, especially scalability on larger machines,
and robustness.

> As even if current_cred() and ctx->creds describe the same
> set of uid,gids the != won't ever match again and
> makes the whole ring unuseable.

Indeed.  It also seems weird that a sqpoll now basically has different
semantics, allowing the io_uring to be used by multiple processes - a
task with a different mm can still wake the sqpoll thread up, even.

Since the different processes attached still can still write to the
io_uring mmaped memory, they can still queue sqes, they just can't
initiate the processing. But the next time the creator of the uring
submits, they will still be - and thus it seems that the kernel needs to
handle this safely. So I really don't get what this actually achieves?
Am I missing something here?

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task
  2020-01-26  5:54     ` Andres Freund
@ 2020-01-26 16:57       ` Jens Axboe
  0 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2020-01-26 16:57 UTC (permalink / raw)
  To: Andres Freund, Stefan Metzmacher
  Cc: Greg Kroah-Hartman, linux-kernel, stable, io-uring

On 1/25/20 10:54 PM, Andres Freund wrote:
> Hi,
> 
> On 2020-01-24 11:38:02 +0100, Stefan Metzmacher wrote:
>> Am 22.01.20 um 10:26 schrieb Greg Kroah-Hartman:
>>> From: Jens Axboe <[email protected]>
>>>
>>> commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream.
>>>
>>> If the credentials or the mm doesn't match, don't allow the task to
>>> submit anything on behalf of this ring. The task that owns the ring can
>>> pass the file descriptor to another task, but we don't want to allow
>>> that task to submit an SQE that then assumes the ring mm and creds if
>>> it needs to go async.
>>>
>>> Cc: [email protected]
>>> Suggested-by: Stefan Metzmacher <[email protected]>
>>> Signed-off-by: Jens Axboe <[email protected]>
>>> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>>>
>>>
>>> ---
>>>  fs/io_uring.c |    6 ++++++
>>>  1 file changed, 6 insertions(+)
>>>
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -3716,6 +3716,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
>>>  			wake_up(&ctx->sqo_wait);
>>>  		submitted = to_submit;
>>>  	} else if (to_submit) {
>>> +		if (current->mm != ctx->sqo_mm ||
>>> +		    current_cred() != ctx->creds) {
>>> +			ret = -EPERM;
>>> +			goto out;
>>> +		}
>>> +
>>
>> I thought about this a bit more.
>>
>> I'm not sure if this is actually to restrictive,
>> because it means applications like Samba won't
>> be able to use io-uring at all.
> 
> Yea, I think it is too restrictive. In fact, it broke my WIP branch to
> make postgres use io_uring.
> 
> 
> Postgres uses a forked process model, with all sub-processes forked off
> one parent process ("postmaster"), sharing MAP_ANONYMOUS|MAP_SHARED
> memory (buffer pool, locks, and lots of other IPC). My WIP branch so far
> has postmaster create a number of io_urings that then the different
> processes can use (with locking if necessary).
> 
> In plenty of the cases it's fairly important for performance to not
> require an additional context switch initiate IO, therefore we cannot
> delegate submitting to an io_uring to separate process. But it's not
> feasible to have one (or even two) urings for each process either: For
> one, that's just about guaranteed to bring us over the default
> RLIMIT_MEMLOCK limit, and requiring root only config changes is not an
> option for many (nor user friendly).
> 
> 
> Not sharing queues makes it basically impossible to rely on io_uring
> ordering properties when operation interlock is needed. E.g. to
> guarantee that the journal is flushed before some data buffer can be
> written back, being able to make use of links and drains is great - but
> there's one journal for all processes. To be able to guarantee anything,
> all the interlocked writes need to go through one io_uring. I've not yet
> implemented this, so I don't have numbers, but I expect pretty
> significant savings.
> 
> 
> Not being able to share urings also makes it harder to resolve
> deadlocks:
> 
> As we call into both library and user defined code, we cannot guarantee
> that a specific backend process will promptly (or at all, when waiting
> for some locks) process cqes. There's also sections where we don't want
> to constantly check for ready events, for performance reasons.  But
> operations initiated by a process might be blocking other connections:
> 
> E.g. process #1 might have initiated transferring a number of blocks
> into postgres' buffer pool via io_uring , and now is busy processing the
> first block that completed. But now process #2 might need one of the
> buffers that had IO queued, but didn't complete in time for #1 to see
> the results.  The way I have it set up right now, #2 simply can process
> pending cqes in the relevant queue. Which, in this example, would mark
> the pg buffer pool entry as valid, allowing #2 to continue.
> 
> Now, completions can still be read by all processes, so I could continue
> to do the above: But that'd require all potentially needed io_urings to
> be set up in postmaster, before the first fork, and all processes to
> keep all those FDs open (commonly several hundred). Not an attractive
> option either, imo.
> 
> Obviously we could solve this by having a sqe result processing thread
> running within each process - but that'd be a very significant new
> overhead. And it'd require making some non-threadsafe code threadsafe,
> which I do not relish tackling as a side-effect of io_uring adoption.
> 
> 
> It also turns out to be nice from a performance / context-switch rate
> angle to be able to process cqes for submissions by other
> processes. Saves an expensive context switch, and often enough it really
> doesn't matter which process processes the completion (especially for
> readahead). And in other cases it's cheap to just schedule the
> subsequent work from the cqe processor, e.g. initiating readahead of a
> few more blocks into the pg buffer pool.  Similarly, there are a few
> cases where it's useful for several processes to submit IO into a uring
> primarily drained by one specific process, to offload the subsequent
> action, if that's expensive
> 
> 
> Now, I think there's a valid argument to be made that postgres should
> just use threads, and not be hampered by any of this. But a) that's not
> going to happen all that soon, it's a large change, b) it's far from
> free from problems either, especially scalability on larger machines,
> and robustness.
> 
> 
>> As even if current_cred() and ctx->creds describe the same
>> set of uid,gids the != won't ever match again and
>> makes the whole ring unuseable.
> 
> Indeed.  It also seems weird that a sqpoll now basically has different
> semantics, allowing the io_uring to be used by multiple processes - a
> task with a different mm can still wake the sqpoll thread up, even.
> 
> Since the different processes attached still can still write to the
> io_uring mmaped memory, they can still queue sqes, they just can't
> initiate the processing. But the next time the creator of the uring
> submits, they will still be - and thus it seems that the kernel needs to
> handle this safely. So I really don't get what this actually achieves?
> Am I missing something here?

Thanks for your detailed reported, I'm going to revert this change for
5.5.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-01-26 16:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <[email protected]>
     [not found] ` <[email protected]>
2020-01-24 10:38   ` [PATCH 5.4 033/222] io_uring: only allow submit from owning task Stefan Metzmacher
2020-01-24 10:41     ` Stefan Metzmacher
2020-01-24 16:58     ` Jens Axboe
2020-01-24 19:11       ` Stefan Metzmacher
2020-01-24 21:41         ` Jens Axboe
2020-01-26  5:54     ` Andres Freund
2020-01-26 16:57       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox