public inbox for [email protected]
 help / color / mirror / Atom feed
From: Jens Axboe <[email protected]>
To: Andres Freund <[email protected]>, Stefan Metzmacher <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>,
	[email protected], [email protected],
	io-uring <[email protected]>
Subject: Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task
Date: Sun, 26 Jan 2020 09:57:29 -0700	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 1/25/20 10:54 PM, Andres Freund wrote:
> Hi,
> 
> On 2020-01-24 11:38:02 +0100, Stefan Metzmacher wrote:
>> Am 22.01.20 um 10:26 schrieb Greg Kroah-Hartman:
>>> From: Jens Axboe <[email protected]>
>>>
>>> commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream.
>>>
>>> If the credentials or the mm doesn't match, don't allow the task to
>>> submit anything on behalf of this ring. The task that owns the ring can
>>> pass the file descriptor to another task, but we don't want to allow
>>> that task to submit an SQE that then assumes the ring mm and creds if
>>> it needs to go async.
>>>
>>> Cc: [email protected]
>>> Suggested-by: Stefan Metzmacher <[email protected]>
>>> Signed-off-by: Jens Axboe <[email protected]>
>>> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>>>
>>>
>>> ---
>>>  fs/io_uring.c |    6 ++++++
>>>  1 file changed, 6 insertions(+)
>>>
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -3716,6 +3716,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
>>>  			wake_up(&ctx->sqo_wait);
>>>  		submitted = to_submit;
>>>  	} else if (to_submit) {
>>> +		if (current->mm != ctx->sqo_mm ||
>>> +		    current_cred() != ctx->creds) {
>>> +			ret = -EPERM;
>>> +			goto out;
>>> +		}
>>> +
>>
>> I thought about this a bit more.
>>
>> I'm not sure if this is actually to restrictive,
>> because it means applications like Samba won't
>> be able to use io-uring at all.
> 
> Yea, I think it is too restrictive. In fact, it broke my WIP branch to
> make postgres use io_uring.
> 
> 
> Postgres uses a forked process model, with all sub-processes forked off
> one parent process ("postmaster"), sharing MAP_ANONYMOUS|MAP_SHARED
> memory (buffer pool, locks, and lots of other IPC). My WIP branch so far
> has postmaster create a number of io_urings that then the different
> processes can use (with locking if necessary).
> 
> In plenty of the cases it's fairly important for performance to not
> require an additional context switch initiate IO, therefore we cannot
> delegate submitting to an io_uring to separate process. But it's not
> feasible to have one (or even two) urings for each process either: For
> one, that's just about guaranteed to bring us over the default
> RLIMIT_MEMLOCK limit, and requiring root only config changes is not an
> option for many (nor user friendly).
> 
> 
> Not sharing queues makes it basically impossible to rely on io_uring
> ordering properties when operation interlock is needed. E.g. to
> guarantee that the journal is flushed before some data buffer can be
> written back, being able to make use of links and drains is great - but
> there's one journal for all processes. To be able to guarantee anything,
> all the interlocked writes need to go through one io_uring. I've not yet
> implemented this, so I don't have numbers, but I expect pretty
> significant savings.
> 
> 
> Not being able to share urings also makes it harder to resolve
> deadlocks:
> 
> As we call into both library and user defined code, we cannot guarantee
> that a specific backend process will promptly (or at all, when waiting
> for some locks) process cqes. There's also sections where we don't want
> to constantly check for ready events, for performance reasons.  But
> operations initiated by a process might be blocking other connections:
> 
> E.g. process #1 might have initiated transferring a number of blocks
> into postgres' buffer pool via io_uring , and now is busy processing the
> first block that completed. But now process #2 might need one of the
> buffers that had IO queued, but didn't complete in time for #1 to see
> the results.  The way I have it set up right now, #2 simply can process
> pending cqes in the relevant queue. Which, in this example, would mark
> the pg buffer pool entry as valid, allowing #2 to continue.
> 
> Now, completions can still be read by all processes, so I could continue
> to do the above: But that'd require all potentially needed io_urings to
> be set up in postmaster, before the first fork, and all processes to
> keep all those FDs open (commonly several hundred). Not an attractive
> option either, imo.
> 
> Obviously we could solve this by having a sqe result processing thread
> running within each process - but that'd be a very significant new
> overhead. And it'd require making some non-threadsafe code threadsafe,
> which I do not relish tackling as a side-effect of io_uring adoption.
> 
> 
> It also turns out to be nice from a performance / context-switch rate
> angle to be able to process cqes for submissions by other
> processes. Saves an expensive context switch, and often enough it really
> doesn't matter which process processes the completion (especially for
> readahead). And in other cases it's cheap to just schedule the
> subsequent work from the cqe processor, e.g. initiating readahead of a
> few more blocks into the pg buffer pool.  Similarly, there are a few
> cases where it's useful for several processes to submit IO into a uring
> primarily drained by one specific process, to offload the subsequent
> action, if that's expensive
> 
> 
> Now, I think there's a valid argument to be made that postgres should
> just use threads, and not be hampered by any of this. But a) that's not
> going to happen all that soon, it's a large change, b) it's far from
> free from problems either, especially scalability on larger machines,
> and robustness.
> 
> 
>> As even if current_cred() and ctx->creds describe the same
>> set of uid,gids the != won't ever match again and
>> makes the whole ring unuseable.
> 
> Indeed.  It also seems weird that a sqpoll now basically has different
> semantics, allowing the io_uring to be used by multiple processes - a
> task with a different mm can still wake the sqpoll thread up, even.
> 
> Since the different processes attached still can still write to the
> io_uring mmaped memory, they can still queue sqes, they just can't
> initiate the processing. But the next time the creator of the uring
> submits, they will still be - and thus it seems that the kernel needs to
> handle this safely. So I really don't get what this actually achieves?
> Am I missing something here?

Thanks for your detailed reported, I'm going to revert this change for
5.5.

-- 
Jens Axboe


      reply	other threads:[~2020-01-26 16:57 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <[email protected]>
     [not found] ` <[email protected]>
2020-01-24 10:38   ` [PATCH 5.4 033/222] io_uring: only allow submit from owning task Stefan Metzmacher
2020-01-24 10:41     ` Stefan Metzmacher
2020-01-24 16:58     ` Jens Axboe
2020-01-24 19:11       ` Stefan Metzmacher
2020-01-24 21:41         ` Jens Axboe
2020-01-26  5:54     ` Andres Freund
2020-01-26 16:57       ` Jens Axboe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox